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ABSTRACT 
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parameters and scale scores. This report documents the test 
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Executive Summary 

This report documents the development and validation of the NELS:88 cognitive test battery. The 
cognitive test battery assesses longitudinal growth between grades 8 and 12 in four content areas - reading 
comprehension, mathematics, science and history/citizenship/geography. The cognitive battery was part 
of the larger National Education Longitudinal Study of 1988 that was monitored by the Longitudinal and 
Household Studies Branch (LHSB) of the National Center for Education Statistics (NCES). The NELS:88 
test battery was administered to a representative sample of 8th graders in the spring of 1988, who were 
then retested in the spring of 1990 and 1992. Response rates varied between 93 to 96 percent for the in- 
school 8th and 10th graders and dropped to about 81 percent for the twelfth graders. There was some 
tendency for students from low socio-economic backgrounds to be over-represented among the non- 
respondents. 

In order to minimize floor and ceiling effects which typically distort gain scores, special 
procedures were designed into the development and administration of the cognitive test battery. The test 
battery used a two-stage multilevel procedure that attempted to tailor the difficulty of the test items to the 
performance level of a particular student For example, students who performed very well on their 8th 
grade mathematics test received a relatively more difficult form in tenth grade than those scoring in the 
middle or in the lower range on their 8th grade test. There were three forms varying in difficulty in 
mathematics and two in the reading area in both grades 10 and 12. Since tenth and twelfth graders were 
taking forms that were more appropriate for their level of ability/achievement, measurement accuracy was 
enhanced and floor and ceiling effects could be minimized. The remaining two content areas, science and 
history/citizenship/geography were only designed to be grade level adaptive i.e., have a different form for 
each grade, and therefore did not have multiple forms varying in difficulty within grade. 

In order to maximize the gain from using an adaptive procedure, special vertical scaling 
procedures were used that allow for Bayesian priors on subpopulations for both item parameters and scale 
Sv:ores. This report documents the test specifications for the multilevel forms as well as the Bayesian 
procedures used in the vertical scaling. The report also includes a comparison of more traditional non- 
Bayesian approaches to scaling longitudinal measures with the Bayesian approach. 

It was found that the multilevel approach did increase the accuracy of the measurement, and when 
used in combination with the Bayesian item parameter estimation, reduced floor and ceiling effects when 
compared to the more traditional item response theory approaches. 
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Chapter 1 
Introduction 

The National Education Longitudinal Study of 1988 (NELS:88) is designed to monitor the 
transition of a national sample of young adults as they progress from eighth grade to high school and then 
on to postsecondary education and/or the world of work. The NELS:88 surveys are monitored by the 
Longitudinal and Household Studies Branch (LHSB) of the National Center for Education Statistics 
(NCES). NELS:88 is the third and most recent in a series of longitudinal studies that are designed to 
provide timely information on trends in academic achievement. The two earlier longitudinal studies 
sp< nsored by NCES were the National Longitudinal Study of the high school class of 1972 (NLS-72) and 
the High School and Beyond (HS&B) study of 1980. 

The primary purpose of the NELS:88 data collection is to provide policy relevant information 
concerning the effectiveness of schools, curriculum paths, special programs, variations in curriculum 
content and exposure, and/or mode of delivery in bringing about educational growth. In addition to the 
test scores described in this report, the NELS:88 database contains a great deal of data on factors relevant 
to cognitive growth, including student questionnaires with information on family background, aspirations 
and attitudes and experiences in and out of school; high school transcripts; and teacher, school and parent 
questionnaires. The sample was designed to provide sufficient numbers of students in "high risk" 
subpopulations to allow for separate analysis of the growth patterns for these critical subgroups. Given 
the ambitious educational achievement goals that are being set for the year 2000, it is critical that we 
gather evidence now on how variations in student characteristics interact with variations in the content and 
processes of educational programs in bringing about cognitive growth. 

The purpose of this report is to document the rationale and technical decisions that were carried 
out in the design, development and scaling of the cognitive battery. 



Sample and Completion Rates 

While the base year (1988) participating sample was 24,599, a subsample was selected for follow- 
up in the subsequent years, with varying probabilities depending on how they clustered in schools. Panel 
test data were obtained on approximately 12,000 core sample individuals who had useable cognitive test 
data on all three (1988, 1990, 1992) occasions. In addition to the core panel sample individuals, there 
were augmented state and other special samples at the base year and succeeding follow-ups. Freshened 
samples were also added at the first and second follow-up to insure a representative sample of students 
within a grade. Additional details about the sample design and survey procedures may be found in the 
second follow-up user's manual (Ingels et al., 1994). Table 1.1 below presents the test completion rates 
for selected subpopulations for individuals in the core panel sample only. 

Inspection of Table 1.1 indicates that approximately two thirds of the total target sample have all 
four cognitive scores on all three occasions. Much of the analysis in this psychometric report will be 
based on this panel sample. Cross-sectional (within-year) analyses that do not require data at all three 
time points will include students who were in the NELS:88 core sample but were not tested at all three 
points in time; other statistics that are internal to the tests themselves and do not make reference to 
national estimates may include the state augmentation samples that were not part of the NELS:88 core. 
These less stringent criteria lead to significantly greater participation rates than those shown in Table 1.1. 
More detailed discussions about non-response rates arc presented in the section on motivation. A detailed 
discussion of sample selection and weighting procedures may be found in Ingels et al. (1994). 
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Table 1.1 

Proportion of the Core Panel Sample Participants with 
All Four Cognitive Tests On AH Three Occasions 





Eligible Core 
Panel Sample 


Percentages With AH 
Tests On AH Occasions 


RAW N 


WTD N 


% RAW N 


% WTD N 


Total 






7fi 
/ \J 


UJ 


Male 


8140 


1492789 


69 


66 


Female 


8349 


1478047 


70 


65 


Asian 


995 


105878 


69 


66 


Hispanic 


2017 


307485 


61 


58 


Black 


1628 


390455 


63 


52 


White 


11662 


2122702 


72 


69 


Public School 8 


12585 


2253702 


74 


72 


Catholic School 8 


850 


149699 


79 


75 


NAIS Private* 


930 


32107 


73 


74 



8 The classification by school type only includes those individuals who were enrolled in school. The remaining classifications, 
gender and race, includes all students whether they are enrolled or not. 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. 
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Chapter 2 
NELS Test Specifications 

This chapter will discuss the special considerations in testing a national sample of students in 
several subject areas over a four-year time span. The rationale for the design of multiple overlapping test 
forms is described, as well as the considerations in choosing the timing and content of each form. 



Aims and Objectives 

The test specifications of the NELS-.88 longitudinal test battery are dictated by its primary 
purpose: accurate measurement of the status of individuals at a given point in time, as well as their 
growth over time. Like its predecessor, the 1980 High School and Beyond (HS&B) test battery, the 
National Education Longitudinal Study (NELS:88) test battery was developed to measure both individual 
status and growth in a number of achievement areas. The four achievement areas are Reading 
Comprehension, Mathematics , Science , and History/Citizenship/Geographv( H/C/G). However, unlike the 
HS&B assessment, which was designed only to measure growth between the tenth and twelfth grades, the 
NELS: 88 battery is designed to measure growth in achievement between the eighth, tenth and twelfth 
grades. Since the NELS:88 assessment spans four years with repeated testing of the same student cohort 
in the eighth, tenth and twelfth grades, it calls for a more flexible testing approach than was required in 
the HS&B longitudinal assessment. 

The construction of the NELS:88 eighth grade battery is in some sense a delicate balancing act 
between several competing objectives. Many of these objectives were suggested by the NELS Technical 
Review Panel (TRP) and/or NCES project staff during the base year development. Some of these 
objectives were as follows: 

The NELS:88 test battery should cover four content areas - Reading, Mathematics, Science, 
and History/Citizenship/Geography. 

Item selection should be curriculum-relevant, with emphasis on concepts, skills and general 
principles. When measuring change or developmental growth, the overemphasis on isolated 
facts at the expense of conceptual and/or problem-solving skills may lead to distortions in the 
gain scores due to forgetting. More will be said about this later. 

The tests should be relatively unspeeded with the vast majority of students completing all 
tests. 

There should be little evidence of floor or ceiling effects. 

Reliabilities of the component tests should be psychometrically acceptable for the purpose of 
measuring individual status as well as growth. While much of the analysis using the NELS 
database will probably be at the group level, there will be many studies that use the test 
scores as covariat.es. In such cases the reliability of the covariatcs becomes important. Also 
when measuring change we need evidence that we are measuring the same things over time. 
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• The accuracy of measurement, i.e., the standard error of measurement, should be relatively 
constant across SES, sex and racial/ethnic groups. In fact, the NELS:88 battery was 
specifically designed to reduce the gap in reliabilities that is typically found between the 
majority group and the racial/ethnic minority groups. 

• The individual test content areas should demonstrate some discriminant validity. That is, 
while the tests should be internally consistent and be characterized by a large dominant factor, 
when factor analyzed together, they should yield a relatively "clean" although oblique four 
factor solution. The four factors should be defined by the four content areas. The Base Year 
Psychometric Report (Rock & Pollack, 1991) presents results for the four factor solution. 
Because of the multilevel nature of two of the four tests in the tenth and twelfth grades, 
intercorrelations among the test scores rather than factor analysis results are presented in this 
report. 

• Subscorcs and/or proficiency scores should be provided where psychometrically justified. The 
test specifications were designed to provide behavioraUy-anchored proficiency (mastery) 
scores in the areas of Reading, Mathematics, and Science. 

• The NELS:88 test battery should attempt to minimize Differential Item Functioning (DIF) 
across gender and racial/ethnic groups that arises from irrelevant content that favors one or 
more of the groups. 

• The NELS:88 test battery should share sufficient common items both across and within grade 
level forms, and with the HS&B battery, to provide articulation of scores for vertical equating 
in NELS:88 as well as cross-sectional equating with the 1980 HS&B sophomore cohort in 
mathematics. 

• There should be sufficient item overlap between the National Assessment of Educational 
Progress (NAEP) mathematics test and the twelfth grade NELS:88 mathematics test to cross- 
walk to the NAEP mathematics scale if desired. 

• The reading test passages should provide relatively broad conient coverage and have items 
that span at least three cognitive process areas. There also should be at least one passage that 
identifies in some way with minority concerns. Similarly, there should be at least one 
passage in which the main character is a female. 

• The four content areas Reading, Mathematics, Science, and Hi story /Citizenship/ Geography 
must be administered (including time for administration instructions) within one hour and a 
half. 

• The tests should be sufficiently reliable to support change measurement, and be characterized 
by a sufficiently dominant underlying factor to support the Item Response Theory (IRT) 
model. This latter requirement is necessary to support the vertical equating between rctestings 
as well as the cross-sectional linking with HS&B and NAEP, if desired. The IRT vertical 
equating puts the scores within a given content area on the same scale regardless of the grade 
in which the score was obtained. This allows the user to interpret scores the same way 
whether they were from the eight, tenth, or twelfth grade. Independent of the vertical scaling, 
the testing time constraints made achieving desired reliabilities problematic without 
introducing some soil of adaptive testing. In order to achieve this level of reliability, as well 
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as reduce the possibility of "floor and ceiling" effects, the Mathematics and Reading tests 
were designed to be multilevel at the tenth grade and twelfth grade. The multilevel adaptive 
approach is discussed below. 

While the NELS:88 battery provides test scores with the usual normative interpretation, it was 
also designed to have "mastery" level scores in mathematics, reading, and science. These 
multiple criterion-referenced levels serve two functions. First, they help with respect to the 
interpretation of what a score level "means" in terms of what Mary or Johnny can or cannot 
do. Second, they are useful in measuring change at particular score points along the score 
scale. In particular, when certain school processes can be expected to be reflected in score 
changes taking place at specific points along the score scale, then changes in percent or 
probability of mastery at that point in the scale would be better measures of the impact of the 
school process on student growth than would changes in the overall test score. More details 
about these criterion-referenced scores and their interpretation will be presented in the section 
on cognitive scores. 



Two Stage Multilevel Testing in a Longitudinal Framework 

The potentially large variation in student growth trajectories over a four year period argues for a 
longitudinal "tailored testing" approach to assessment That is, in order to accurately assess a student's 
status both at a given point in time as well as over time, the individual tests must be capable of measuring 
across a broad range of ability/achievement. If the same test, in say, Mathematics and Reading 
Comprehension were administered to the same student at the eighth, tenth, and twelfth grades, the potential 
for observing "floor effects" at grade eight and "ceiling effects" at grade twelve is greatly increased. Of 
course if all four tests were quite long and included many very difficult as well as many very easy items, 
then theoretically there would be little opportunity for floor and ceiling effects to operate. 

Unfortunately operational versions of the test must be relatively short in order to minimize the 
testing time burden on the students and their school systems. The solution to this problem was to use a 
two-stage testing procedure that allows one to at least partially tailor a test form to a particular individual's 
ability/achievement level. 

That is, a two-stage multilevel longitudinal testing procedure was implemented that used the eighth 
grade reading and mathematics test results for each student to assign him or her to a different form of the 
test when he or she was re-tested in tenth grade. The same procedure was repeated in the twelfth grade. 
For example, students scoring relatively high on the eighth grade test, (top twenty-five percent) in say, 
mathematics were given a more difficult mathematics test form when they were retested as tenth graders. 
Students scoring relatively low in the eighth grade (bottom twenty-five percent) received an easier form 
when retested as tenth graders. Students scoring in trie middle range received an "average" difficulty 
mathematics form. Since tenth and twelfth grade students would be taking forms that were in a sense 
appropriate to their particular level of ability/achievement, measurement accuracy would be enhanced, and 
floor and ceiling effects would be minimized. The relative absence of ceiling effects should make the 
assessment of gain more accurate for students who had relatively high scores as eighth graders and/or as 
tenth graders. Similarly, an accurate estimate of gain for low scoring eighth graders should also be 
enhanced, since floor effects should be minimized. 

In summary, the tenth and twelfth grade mathematics and reading tests incorporated multilevel 
forms differing in difficulty. The tenth and twelfth grade science and history/citizenship/gcography tests 
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were grade level adaptive in the sense that everyone took the same form within a grade but each 
succeeding grade level form included additional more difficult items. 

What does the utilization of a two-stage multilevel procedure have to say about how the 
components of the NELS:88 battery should be constructed? With respect to the eighth grade, two of the 
eighth grade tests (reading and mathematics) were to serve as "branching" or "routing" tests, and thus 
ideally they should have good measurement properties throughout the test score range. That is, the test 
scores should provide reliable information at the high, the middle, and the low end of the test score 
distribution since students in these score ranges could then be routed to tests of quite different average 
difficulties in the tenth grade. 

Because of their branching role the eighth grade reading and mathematics tests were designed with 
somewhat more broad band measurement properties in mind. Operationally, the goal of maintaining good 
measurement accuracy throughout the test score range is accomplished by building tests with a relatively 
rectangular frequency distribution of item difficulties, that is, equal numbers of test items at each 
difficulty. The typical test, however, tends to follow a normal distribution of difficulties with the majority 
of the items in the middle difficulty range. However, if one wished to use the base year test as not only 
a measure of an individual's achievement status in grade 8, but also as a routing test for assignment to 
tenth grade forms that vary in difficulty, then one should have a more rectangular distribution of difficulty 
levels. 

The tenth and twelfth grade tests in reading and mathematics must include sufficient linking items 
both across grades as well as across forms within grade to allow both cross-sectional and vertical equating 
using Item Response Theory (IRT) models (Lord, 1980). In the case of the science and 
history/citizenship/geography (H/C/G) tests, linking items need to be present across grade forms only. In 
mathematics and reading the average difficulty (percent getting an item correct) of the various within-grade 
forms should be in the .45 to .60 range, and the distribution of the item difficulties (P+) should be more 
peaked than for forms that are designed to measure efficiently across a broad range of ability. The re- 
values are not symmetric around .50 since in theory it is assumed that fewer students need to guess when 
the items are somewhat easier. 

While the multilevel adaptive approach used in mathematics and reading and the grade level 
adaptive approach used in the science and the H/C/G tests helped in minimizing floor and ceiling effects, 
it was decided that more recent developments in IRT models would also be necessary to take full 
advantage of the adaptive nature of the NELS:88 battery. More specifically, a Bayesian procedure 
(Mislevy & Bock, 1989; Muraki & Bock, 1987) was used in estimating both the item parameters and the 
ability scores. This procedure allowed for separate prior ability distributions, thereby taking into 
consideration the differing ability distributions associated with the various forms used across and within 
grades. More details will be presented about this procedure in Chapter 3 as part of a technical discussion 
dealing with the special IRT estimation model that was used. 



Specifications for Individual Tests 

Based on simulations utilizing field test results (Rock & Pollack, 1987), ETS test development 
experts determined the number of test items needed to provide accurate assessment of each content area, 
and the time required to minimize speededncss. Given that the maximum allowable testing time for eighth 
graders was approximately one hour and thirty minutes, including five minutes for instructions, it was 
decided that the time would be apportioned in the following way among the test batten' components: 
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Reading - Twenty-one questions in twenty-one minutes. 
Mathematics - Forty questions in thirty minutes. 
Science - Twenty-five questions in twenty minutes. 
History/Citizenship/Geography - Thirty questions in fourteen minutes. 

The items that were used in the final eighth grade forms were selected from a much larger pool 
of items composed of items from NAEP, HS&B, the Second International Mathematics Study (SIMS), 
ETS test files from previous operational tests, and a pool of items specifically written for the NELS:88 
Battery. The selection of items for the pre-test item pools was based on the consensus of the members 
of subject matter committees made up of curriculum experts. 

The subject matter committees consisted of educators, teachers, and college professors specializing 
in middle school curricula. There was considerable personnel overlap with similar subject matter 
committees used in the NAEP item pool development. ETS test development specialists were in 
attendance and worked with their respective subject matter committees in developing the eighth, tenth and 
to some extent the twelfth grade assessment objectives. Once the assessment objectives were agreed upon, 
the subject matter committee members classified the items according to the objectives. A pool of 50 
Reading items, 82 Mathematics items, 42 Science items, and 60 History/Citizenship/Geography items was 
selected for pretesting. Field tests were administered to eighth, tenth and twelfth graders in the Spring 
of 1987 (Rock & Pollack, 1987). The results of the field testing were scrutinized by additional 
committees of subject matter experts who suggested numerous modifications in content, format and 
wording of the items, as well as making judgments on content coverage. Final revisions and item 
selections were made by project staff on the basis of their input, and reviewed by NCES staff. 



Matching Test Content to Curriculum 

The question of overlap between test items and curriculum content has received increasing 
attention over the last ten years and evaluation methodologies have come to be dominated by the doctrine 
of maximal overlap (Frechtling, 1989). Mehrens (1984) and Cronbach (1963), however, questioned 
whether maximal overlap is in fact desirable except possibly in those cases where a'specific program is 
being evaluated. Mehrens argues that a close match between curricular and test content is desirable only 
if one wishes to make inferences about specific objectives taught by a specific teacher to a specific school. 
Even if one would wish to evaluate the effects of a specific teacher in a specific class, one inference of 
importance is the degree to which the specific knowledge taught in that class generalizes to other relevant 
domains. 

Nitko (1989) argues that tests designed to measure individuals and to facilitate their learning 
within a particular instructional context arc not necessarily optimum for measuring school or program 
differences. Similarly Airasian & Madaus (1983) suggest that the following design variables be taken into 
account: 

(A) The ability of tests to detect differences between groups of stiuents. 

(B) The relative representativeness of the content- behavior-process sampled by test items. 
(Q The parallelism of the response formats and mental processes ! earned during instruction with 

those defined by the test tasks. 

(D) The properties of the scores and the way that they will be summarized and reported. 

(E) The validity of the inferences about school and program effectiveness that can be made from 
the test results. 
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Experience and practice suggests that tests are unlikely to detect differences between schools and 
programs when total test scores are used and when the subject matter tested is likely to be related to 
learning in the home (e.g., reading) rather than to schooling (e.g., mathematics) (Airasian & Madaus, 1983; 
Linn & Harnisch, 1981). 

Schmidt (1983) identifies three major types of domains from which content to be covered can be 
drawn: a priori domains, curriculum-specific or learning-material-specific domains, and instructional 
material domains. Nitko (1983) suggests that "agents" not associated with local schools or particular 
programs tend to define a priori domains by using social criteria in judging what is important for all to 
learn. He goes on to suggest that test exercises in the National Assessment of Educational Progress 
(NAEP) as well as state assessment programs are examples of assessment instruments built from a priori 
domains since they specify content to be included without necessarily linking that content to specific 
instructional material or specific instructional events. 

Cole & Nitko (1981) suggest that another design variable be considered in building tests to detect 
school and program effectiveness. They suggest that students require more time to acquire global skills 
and to grow in general educational development than to learn specific knowledges and skills. They suggest 
that tests measuring the former are less sensitive to measuring short term instructional efforts than tests 
measuring the latter. 

Cooley (1977) and Leinhardt (1980) argue for the collection of relevant classroom variables and 
developing tests that are sensitive to differences between classrooms within-program. Leinhardt & 
Seewald (1981) describe several within-school, program, and classroom variables that are important to 
program evaluators and how to measure chem. Mehrens and Phillips (Mehrens, 1984; Mehrens & Phillips, 
1986; Phillips & Mehrens, 1988), however, found no significant differences on standardized tests from 
the use of different textbooks and different degrees of curriculum -test overlap when previous achievement 
and socioeconomic status were taken into account. 

In the development of NELS:88 test items, efforts were made to take a middle road in the sense 
that our curriculum experts were instructed to select items that tapped general knowledge found in most 
curriculums but typically did not require a great deal of isolated factual knowledge. The emphasis was 
to be on understanding concepts and the measurement of problem-solving skills. However, it was thought 
necessary to assess the basic operational skills (e.g., simple arithmetic and algebraic operations) which are 
the foundations for successfully carrying out the problem-solving tasks. 

The incorporation in the mathematics test of the relatively simple arithmetic and algebraic items 
which measure procedural or factual knowledges served two purposes. First, this subset of items provided 
better assessment for those low scoring students who were just beginning to develop their "basic 
mathematical skills". Second, these items should be able to provide a limited amount of diagnostic 
information about why some students are not able to successfully carry out the tasks defined in the 
typically more demanding problem-solving items. For example, students who are not proficient on the 
problem-solving items can be further divided into two groups based on their performance on the 
arithmetical/algebraic procedural skill items. One subgroup could not very well be proficient on the 
problem-solving items since they did not demonstrate sufficient skills on the simple arithmetical/algebraic 
procedures that are a necessary but not a sufficient condition for successful performance on the problem- 
solving tasks. The remaining subgroup, however, had sufficient grounding in the basics as demonstrated 
by their successful performance on the procedural items but were unable to carry out the logical operations 
necessary to complete the solutions to the problem solving items. 
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This hierarchical nature of the required skills is put to formal use in the development of 
behaviorally anchored proficiency level scales for reading, science and mathematics. This criterion- 
referenced interpretation is discussed further in the chapter describing the estimated scores. 

This concern with respect to the maximal overlap doctrine is particularly relevant to the 
measurement of change over relatively long periods of exposure to varied educational treatments. That 
is, the two-year gaps between re-testings coupled with a very heterogeneous student population are quite 
likely to coincide with considerable variability in course taking experiences. This fact, along with the 
constraints on testing time, makes coverage of specific curriculum related knowledges very difficult Also, 
as indicated above, specificity in the knowledges being tapped by the cognitive tests could lead to 
distortions in the gain scores due to forgetting of specific details. The impact on gain scores due to 
forgetting should be minimized if the cognitive battery increasingly emphasizes general concepts and 
development of problem solving abilities. This emphasis should increase as one goes to the tenth and 
twelfth grades. Students who take more high level courses, regardless of the specific course content, are 
likely to increase their conceptual understanding as well as gain additional practice in problem-solving 
skills. 

At best any nationally based longitudinal achievement testing program must be a compromise that 
attempts to balance testing time burdens, the natural tensions between local curriculum emphasis and more 
general mastery objectives, and the psychometric constraints (in the NELS:88 case) in carrying out both 
vertical equating (year-to-year) and cross-sectional equating (form-to- form within year). NELS:88 
fortunately did have the luxury of being able to gather cross-sectional pre-test data on the item pools. 
Thus we have been able to take into consideration not only the general curriculum relevance but whether 
or not the items demonstrate reasonable growth curves, as well as meet the usual item analysis parameter 
requirements for item quality. 

The following sections contain descriptions of the content and format of each of the four 
achievement tests along with selected classical item statistics. 



Reading 

The reading test forms consisted of four or five reading passages, ranging in length from a single 
paragraph to a half-page. There are two forms of the reading test, differing in difficulty, in both the tenth 
and twelfth grade. Each passage in the reading tests (or forms) was followed by three to five multiple- 
choice questions addressing the students' ability to reproduce details of the text, translate verbal statements 
into concepts (comprehension), or draw conclusions based on the material presented (inference/evaluation). 
A total of .21 questions was presented in 21 minutes. The amount of time allowed for each question, 
which is relatively long compared to the other three content areas, takes into account the length of time 
needed for reading the passages before answering the questions. 

The reading tests typically began with the least difficult passage followed by four or five relatively 
easy questions. The content/process specifications of the pool of items that made up NELS:88 reading 
forms across all grades and forms within grade are presented in Table 2.1. The percent answering each 
item correctly (P+) and the item-total correlations (biserials) are presented by grade, and by form within 
grade for the total population in Tables 2.2 and 2.3. The CRT parameters for the reading test are presented 
in appendix E-l. The P+ values and biserials are presented for those forms and grades for which they 
were administered. The more difficult items that differentiated the twelfth grade "high" form from the 
easier forms required comprehension of social studies material or inferences based on science material. 
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Appendices A-l to A-5 present the P+'s and biserials for gender and racial/ethnic groups also. Tables 2.2 
and 2.3 not only present the P+'s and biserials by form, but the reader can quickly identify the linking 
items for each of the forms. The linking items provide the overlap between forms that is necessary to put 
all scores on the same vertical scale, regardless of the form given. In general, we have tried to be 
conservative in the sense that we have more overlapping items than one typically finds in a vertically 
equated test battery. 

Table 2.1 
NELS:88 Reading Specifications 
Content by Process by Test Forms" 





Content Area 


Process 


Literary 


Science 


oociai Muuies/vsiner 


Kcproouciion or ueiaii 








lest form 


3 


1 




otn Urade 


3 


1 




lutn urade Low 


2 


1 


1 


lutn oraae nign 


3 


1 


1 


I2tn Urade Low 






1 


12th Grade High 








Comprehension of Thought 








Test Form 


1 


1 


1 


8th Grade 


1 


1 


1 


10th Grade Low 


3 


1 


2 


10th Grade High 




2 


4 


12th Grade Low 




1 


8 


12th Grade High 








Inferences and/or 








Evaluative Judgements 








Test Form 


10 


1 


3 


8th Grade 


10 


1 


3 


10th Grade Low 


9 


1 


1 


10th Grade High 


6 


1 


3 


12th Grade Low 


4 


3 


3 


12th Grade High 









"Entries in table are the number of items 



ERIC 



10 



Psychometric Report for the NELS:88 
Base Year Through Second Follow-Up 



Table 2.2 
Reading: Proportion Correct 







First Follow-up 


Second Follow-up 


item iNo. 


Base Year 


Low 


High 


Low 


High 


Item 1 


.95 


.92 




.93 




Item 2 


OC 


.80 




01 
.82 




Item 5 


.82 


.77 




on 
.8U 




Item 4 


CT 
.J / 






.J / 




Item 5 


.JJ 


.40 




CA 
.30 




Item 6 






.Oj 






item / 






cc 

JJ 






item o 






.JJ 






item y 






(if. 
.00 







item iu 






.J / 






item 1 1 






0/1 
.84 






item lz 






.OU 






item 13 






If, 






liem 14 








.25 




Item 15 


.60 


.54 


.86 


.58 




Item 16 


.41 


.33 


.67 


.36 




Item 17 


.49 


.44 


.81 


AS 




Item 18 


.61 


.54 








Item 19 


.39 


.36 


.52 


.36 


.57 


Item 20 


.59 




.76 
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Table 2.2 

Reading: Proportion Correct (cont'd) 





DtlSC 1 Cdl 


First Follow-up 


Second Follow-up 


Low 


High 


Low 


High 


Item 21 




.65 








Ttpm 99 


71 


fi9 
.OZ 


.7 I 


.03 


Q4 


Ttf»m 93 




.40 


. ly 


.JJ 


.00 


Ttpm 94 

HC1I1 Lrt 


.HO 


41 


.OZ 


47 


8Q 
.07 


Ttpm 9S 










47 


Item 26 










70 


Ttpm 97 










90 


Item 28 

X \\f III J^\J 










87 


Item 29 






<1 

.J 1 






Item 10 






61 

.UJ 






ilviii Jl 






78 
. / o 






Ttt*m X) 

iXXslll JZ. 






4^ 






Item 33 






36 






Item 34 










.59 


Item 35 










.32 


Item 36 










.50 


Item 37 










.42 


Item 38 


.46 


.38 




.48 




Item 39 


.76 


.71 




.79 




Item 40 


.54 


.40 
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Table 2.2 

Reading: Proportion Correct (cont'd) 







First Follow-up 


Second Follow-up 


Item No. 


Base Year 


Low 


High 


Low 


High 


Item 41 

l IA, 111 TI 


.54 


46 




.54 




Item 42 


.63 


J5 








Item 45 


.70 


.67 








Item 44 


.62 


.55 








Item 45 








.64 


.84 


Item 46 








.42 


.61 


Item 47 








.68 




Item 48 








.35 


.52 


Item 49 








.34 


.56 


Item 50 










.77 


Item 51 










.49 


Item 52 










.43 


Item 53 










.44 


Item 54 










.30 


Mean 


.61 


.55 


.67 


.55 


.62 


S.D. 


.14 


.15 


.15 


.18 


.20 


Unwtd 


23643 


9115 


8717 


7076 


7154 


Wtd N 


2897540 


1511539 


1368601 


1222645 


1058046 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Liepartmcnt of Education. National Center for 
Education Statistics. 
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Table 2.3 
Reading: R-Biserial 



Item No. 


Base Year 


First Follow-up 


Second Follow-up 


Low 


High 


Low 


High 


Item 1 


.60 


.63 




.64 




Item 2 


.63 


.61 




.66 




Item 3 


.65 


.65 




.67 




Item 4 


.67 


.59 




.64 




Item 5 


.67 


.58 




.62 




Item 6 






.51 






Item 7 






.53 






Item 8 






.57 






Item 9 






.70 






Item 10 






.53 






Item 11 






.72 






Item 12 






.62 






Item 13 






.70 






Item 14 








.47 




Item 15 


.65 


.61 


.68 


.70 




Item 16 


.63 


.51 


.61 


.61 




Item 17 


.68 


.61 


.69 


.62 

> 




Item 18 


.57 


.45 








Item 19 


.44 


.41 


.41 


.37 


.43 


Item 20 


.64 




.59 
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Table 2.3 
Reading: R-Biserial (cont'd) 



Item No. 


Base Year 


First Follow-up 


Second Follow-up 


Low 


High 


Low 


High 


Item 21 




.59 








Item 22 


.75 


.69 


.75 


.69 


.66 


Item 23 


.55 


.48 


.66 


.52 


.61 


Item 24 


.65 


.58 


.73 


.62 


.65 


Item 25 










.46 


Item 26 










.47 


Item 27 










.45 


Item 28 










.62 


Item 29 






.50 






Item 30 






.47 






Item 31 






.65 






Item 32 






.48 






Item 33 






.41 






Item 34 










.51 


Item 35 










.47 


Item 36 










.59 


Item 37 










.55 


Item 38 


.70 


.61 




.66 




Item 39 


.74 


.72 




.69 




Item 40 


.66 


.52 









o 

ERIC 



o - 



15 



Psychometric Report for the NELS:88 
Base Year Through Second Follow-Up 



Table 2.3 
Reading: R-Biserial (cont'd) 







First Follow-up 


a 

Second Follow-up 


Item No. 


Base Year 


Low 


High 


Low 


High 


Item 41 


.53 


.47 




.50 




Item 42 


.67 


.64 








Item 43 


.64 


.58 








Item 44 


.62 


.53 








Item 45 








.53 


.66 


Item 46 








.33 


.61 


Item 47 








.59 




Item 48 








.45 


.54 


Item 49 








.39 


.60 


Item 50 










.60 


Item 51 










.47 


Item 52 










.47 


Item 53 










.44 


Item 54 










.45 


Mean 


.63 


.57 


.60 


.57 


.54 


S.D. 


.07 


.08 


.10 


.11 


.08 



Source: National Education Longitudinal Study of 1988: Second Follow-Up. U.S. Department of Education. National Center for 
Education Statistics. 



Mathematics 

Tables 2.4, 2.5 and 2.6 present the content by process specifications and the P+'s and biserials for 
the seven mathematics forms respectively. Appendices B-l to B-7 give the P+'s and biserials for the 
gender and racial/ethnic groups. Appendix E-2 presents the IRT item parametc! . for the mathematics test. 
The biserials do drop below the desirable .45 - .50 range for some of the forms, primarily due to the 
restriction in range of abilities that occurs within a form. Inspection of Table 2.4 indicates that what 
distinguishes the "high" tenth and twelfth grade forms from the other forms is the increased emphasis on 



O 16 

ERIC 

ummmmmim 



Psychometric Report for the NELS:88 
Base Year Through Second Follow-Up 



Table 2.4 
NELS:88 Math Specifications 
Content by Process by Test Forms* 



Process 


Arithmetic 


Algebra 


Geometry 


Data/Prob 


Adv 
Topic 


Skill/Knowledge 












iest rorm 


1U 


5 


1 


l 




bin Grade 


12 


4 


2 




** 


lutn Urauc Low 


n 

y 


3 




1 

1 


1 


mm uraue mco 


0 


J 




2 


2 


luui uraae nign 


1U 


A 

4 


2 






izw urauc low 


1 


L 




1 


1 


12th Grade Med 


1 


2 




1 


2 


12th Grade High 












Under/Comprehend 












Tap* CVh»«vk 

i est rorm 


0 


1 


3 


3 




oin oraae 


/ 


0 


3 


2 




luui uraae low 


0 


/L 

0 


3 


2 




1 ftok P.rnHu \/inA 

luin oraue iviea 


-3 


/ 


2 


5 


l 


luin uraue nign 


0 


J 


3 


3 




iztn uraae Low 


4 


0 


4 


2 




12th Grade Med 


l 


5 


7 


1 


3 


12th Grade High 












Problem Solving . 












Test Form 


3 








1 


8th Grade 


3 








1 


10th Grade Low 


3 


2 


2 




2 


10th Grade Med 


2 


2 


3 




2 


10th Grade High 


4 




2 




1 


12th Grade Low 


4 


3 


5 




1 


12th Grade Med 


2 


4 


9 


1 


1 


12th Grade High 













'Entries in tabic are the number of items 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department cf Education, National Center for 
Education Statistics. 
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Table 2.5 
Math: Proportion Correct 







First Follow-up 


Second Follow-up 


Item No. 


nasc xear 


Low 


IVllU 


Hi oh 

mgn 


Low 


Mid 


Hieh 


Item 1 


.56 


.42 


.67 


.92 


.52 


.76 




Item 2 










4A 
.HO 






Item j 


AQ 






.yj 


.Jo 






Item 4 










.711 






Item 5 


.5/ 


.37 


.02 


.y\J 








Item 6 


CO 

.5y 


A C 

.45 


. /5 




S8 

.JO 







Item 7 


£C 

.55 


A 1 

.4 1 






<n7 

.j / 






Item 8 


.51 


A A 

.44 


. /I 


OA 


44 






Item 9 


.0/ 


.49 


.12. 


.y5 


/18 
.HO 


78 




Item 10 


.00 


C 1 

,51 












Item 11 


.3 1 


in 
.5 1 


7n 


OA 


49 


78 

. r O 




Item 12 




ic 


AO 
.OZ 


Q3 

.73 


40 


74 




item 15 


/LI 


.J 1 


.J J 


87 


35 






Ttam 1 -1 

item m 




71 
. / 1 






.80 






item ij 


41 




40 


88 

.OO 








item io 


/LI 


.ZO 


CA 

.JO 


84 








item i / 


. JU 




CA 

.JO 


R4 








item is 


47 




.47 


.79 








Item 19 




.27 












Item 20 




.27 












Item 21 




.54 






.51 






Item 22 


.52 


.30 


.62 


.90 


.31 


.73 




Item 23 


.41 


.27 


.49 


.87 


.37 


.60 




Item 24 


.45 




.49 


.83 




.j J 


.90 


Item 25 


.37 




.41 


.73 




.46 


.82 
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Table 2.5 
Math: Proportion Correct (cont'd) 



Item No. 


Base Year 


First Follow-up 


Second Follow-up 


Low 


Mid 


High 


Low 


Mid 


High 


Item 26 


.35 


.21 


.49 


.84 


.22 


.56 


.86 


Item 27 














.40 


Item 28 


.50 


.27 


.58 


.92 


.31 


.66 




Item 29 


.71 


.57 






.96 


.56 




Item 30 


.79 


.68 


.82 




.75 


.86 




Item 31 


.70 


.63 


.75 




.66 


.77 




Item 32 


.52 


.31 


.59 


.93 


.35 


.69 




Item 33 


.79 


.73 


.88 




.74 


.90 




Item 34 


.46 




.49 


.71 


.43 


.58 




Item 35 


.59 


.45 


.69 


.88 


.43 


.75 




Item 36 


.52 


.39 


.58 


.85 


.41 


.64 


.89 


Item 37 


.38 


.17 


.46 


.92 


.20 


.50 


.95 


Item 38 


.45 




.59 


.92 








Item 39 


.27 


.31 


.62 


.92 


.34 


.72 


.97 


Item 40 


.41 


.32 


.39 


.66 




.39 


.80 


Item 41 












.27 


.48 


Item 42 














.51 


Item 43 








.31 




.20 


.41 


Item 44 


.40 


.23 


.49 


.86 


.26 


.58 


.92 


Item 45 










.25 


.31 


.53 


Item 46 








.55 






.71 


Item 47 








.45 






.59 


Item 48 














.46 


Item 49 








.66 






.90 


Item 50 


.56 


.46 


.61 


.86 


.44 


.67 
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Table 2.5 
Math: Proportion Correct (cont'd) 



Item No. 




First Follow-up 


Second Follow-up 


Base Year 


Low 


Mid 


High 


Low 


Mid 


High 


Item 51 






.42 


.77 




.56 


.91 


Item 52 








.53 






.76 


Item 53 






.55 


.83 








Item 54 






.35 


.69 




.36 


.81 


Item 55 






.34 


.68 




.36 


.76 


Item 56 






.29 


.60 




.33 


.71 


Item 57 






.29 


.64 




.36 


.79 


Item 58 












.06 


.15 


Item 59 












.15 


.24 


Item 60 


.71 


.54 


.78 




.65 


.91 




Item 61 


.79 


.76 


.91 




.85 


.93 




Item 62 


.68 


.55 




.66 








Item 63 


.65 


.56 


.73 




.59 


.73 




Item 64 


.61 


.33 






.32 






Item 65 




.23 












Item 66 




.68 






.80 






Item 67 










.60 




.93 


Item 68 










.14 




.89 


Item 69 










.28 


.40 


.67 


Item 70 










.22 


.45 


.84 


Item 71 












.46 


.59 


Item 72 












.33 


.57 


Item 73 












.23 


.57 


Item 74 














.41 


Item 75 














.54 
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Table 2.5 
Math: Proportion Correct (cont'd) 



Item No. 


Base Year 


First Follow-up 


Second Follow-up 


Low 


Mid 


High 


Low 


Mid 


High 


Item 76 














.41 


Item 77 














.37 


Item 78 














.16 


Item 79 














.30 


Item 80 














.23 


Item 81 














.26 


Mean 


.54 


.44 


.58 


.80 


.48 


.55 


.62 


S.D. 


.13 


.17 


.15 


.15 


.19 


.22 


.24 


Unwtd 


23648 


3199 


9780 


4814 


2554 


7717 


3965 


Wtd N 


2897116 


545728 


1635418 


689739 


429799 


1293720 


557388 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. 
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Table 2.6 
Math: R-Biserial 







First Follow-up 


Second Follow-up 


Item No. 


Base Year 


Low 


Mid 


High 


Low 


Mid 


High 


Item 1 


fin 

.DU 


A 1 


CI 

.3 1 


.JO 


4? 


54 




Ttpm ? 

1 l&vlll Xr 










.45 






Ttpm % 


.56 


.D 1 




52 


.40 






Ttpm A 
llCIll ** 




4Q 






.53 






ILCIIl J 


66 


44 


56 


.55 








TtATYl ft 

llCUl u 


68 


4Q 


61 

> Kf 1 




.48 






Ttpm 7 


.65 


.45 






.48 






Item 8 


.60 


.46 


.63 


.66 


.43 






Ttpm Q 

l Uw 111 7 


.60 


.40 


.59 


.68 


.47 


.61 




Item 10 


.55 


.38 












Item 11 


.65 


.48 


.70 


.93 


.50 


.72 




Item 12 


.65 


.41 


.62 


.75 


.50 


.65 




Item 13 


.51 


.40 


.53 


.56 


.31 






Item 14 




.51 






.46 






Item 15 

AlKslll *.iJ 


.69 




.63 


.58 






' 


Ttpm 16 

AUkslll t\J 


.66 


.43 


.61 


.54 








Ttpm 17 






.52 


.45 








Ifpm 18 

1 11 /■ U 


.27 




.26 


.37 








Item 19 




.36 












Item 20 




.37 












Item 21 




.40 






.43 






Item 22 


.70 


.49 


.61 


.60 


.44 


.55 




Item 23 


.60 


.40 


.54 


.58 


.38 


.60 




Item 24 


.45 




.45 


.52 




.54 


.50 


Item 25 


.58 




.49 


.53 




.49 


.40 
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Table 2.6 
Math: R-Biserial (cont'd) 



Item No. 


Base Year 


First Follow-up 


Second Follow-up 


Low 


Mid 


Hieh 


Low 


Mid 


Hieh 


Ttpm lf\ 


S4 


.la 


.oU 


CO 

.58 


.32 


.57 


.37 


Item 27 














cc 

.J J 


Item 28 


.69 


41 


.DZ 


7fl 
. /U 


.jU 






Item 29 


.51 


41 




71 


17 






Item 30 


.50 




4fi 




93 


36 
.JO 




Item 31 


.46 




39 

.J/ 




33 


43 




Item 32 


.64 




fil 


. / \} 


44 


6? 




Item 33 


.59 


.50 


61 




3S 


44 




Item 34 


.31 




.23 


.41 


.21 


.37 




Item 35 


.57 


.40 


.47 


.41 


34 


45 




Item 36 


.54 


.40 


.46 


.52 


.37 


.48 


46 


Item 37 


.70 


.33 


.65 


.65 


.36 


.64 


.43 


Item 38 


.70 




.60 


.56 








Item 39 


.62 


.56 


.65 


.62 




71 


41 


Item 40 


.32 


.16 


.30 


.55 




37 


63 


Item 41 












20 


49 


Item 42 














.to 


Item 43 








.38 




.33 


.40 


Item 44 


.63 


.37 


.51 


.55 


.41 


.61 


.51 


Item 45 










.16 


.34 


.38 


Item 46 








.52 






.55 


Item 47 








.35 






.37 


Item 48 














.58 


Item 49 








.59 






.68 


Item 50 


.50 


.31 


.43 


.49 


.35 


.46 
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Table 2.6 
Math: R-Biserial (cont'd) 







First Follow-up 


Second Follow-up 


Item No. 


Base Year 


Low 


Mid 


High 


Low 


Mid 


High 


Item 51 






.49 


cc 




.01 


Jo 


llcul JZ 








62 






.65 


Ttf>m 
















Tfpm *vl 






35 


.67 




.49 


.57 


Ttpm 55 






.40 


.56 




.45 


.58 


Item 56 






.34 


.48 




.42 


.44 


Ttpm 57 

1 U_ 111 1 






.49 


.53 




.53 


.51 


Item 58 












.25 


.56 


Item 59 












.17 


.48 


Item 60 


.69 


.56 


.66 




.65 


.79 




Item 61 


.51 


.57 


.63 




.58 


.59 




Item 62 

A WV* 111 vi- 


.71 


49 






.50 






Item 63 


.45 


.41 


.29 




.44 


.30 




Ttpm 64 


.76 


55 






.50 






Ttem 6S 




28 












Ttf*m 66 




47 






.45 






Item 67 











.43 




.44 


Item 68 











.37 




.61 


Item 69 










.38 


.39 


.45 


Item 70 










.28 


.60 


.51 


Item 71 












.22 


.35 


Item 72 












.25 


.48 


Item 73 












.52 


.59 


Item 74 














.40 


Item 75 














.54 
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Table 2.6 
Math: R-Biserial (cont'd) 



Item No. 


Base Year 


First Follow-up 


Second Follow-up 


Low 


Mid 


High 


Low 


Mid 


High 


Item 76 














.65 


Item 77 














.61 


Item 78 














.43 


Item 79 














.44 


Item 80 














.64 


Item 81 














.59 


Mean 


.58 


.42 


.52 


.57 


.41 


.48 


.51 


S.D. 


.11 


.09 


.12 


.11 


.10 


.15 


.09 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. 
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understanding concepts and problem solving in the areas of geometry, data/probability, and advanced 
topics. Advanced topics included pre-calculus items and/or analytic geometry items. It should be kept 
in mind that while an item may be classified as a geometry item, it more often than not requires both 
algebraic and numeric skills for a correct solution. Similarly, the algebra items almost always require 
some facility in arithmetic to arrive at the correct solution. To the extent that any discipline tends to have 
a "building block" structure, the resulting assessment must also reflect the building block nature of the 
knowledge domain. 

This hierarchical knowledge domain has its advantages and disadvantages. The advantage of a 
hierarchical knowledge domain is that it typically generates a large general factor which is a prerequisite 
for the item response theory (IRT) approach to the vertical scaling necessary for measuring longitudinal 
change on the same scale. One added benefit of the hierarchical knowledge domain is that it facilitates 
the interpretation of various ascending points along the vertical scale. That is, score points along the scale 
can be assigned a meaning to the extent they reflect different proficiency levels along the knowledge 
hierarchy. In this sense knowledge hierarchies allow one to have multiple criterion-referenced points along 
the vertical scale. The primary disadvantage is that subscores based on content areas are not likely to have 
much differential validity since virtually all mathematics items incorporate knowledges from many 
different content areas. In Chapter 4 on score estimation, more details will be presented on how both 
normative scores and mastery or proficiency score estimates were obtained in reading, science, and 
mathematics. 



Science 

Table 2.7 presents the content by process item specifications for the science forms. 



Table 2.7 
NELS:88 Science Specifications 
Content by Process by Test Forms" 



Process 


Earth Sci 


Chem 


Sci Meth 


Life Sci 


Phy Sci 


Skill/Knowledge 












Test Form 












8th Grade 


5 ■ 


2 




3 




10th Grade 


3 


2 




2 


1 


12th Grade 


3 


3 




3 


1 


Under/Comprehend 












Test Form 












8th Grade 


2 


2 


1 


2 




10th Grade 


2 


1 


1 


2 


1 


12th Grade 


1 




3 


1 




Problem Solving 












Test Form 












8th Grade 


1 


3 


2 


2 




10th Grade 




3 


1 


3 


2 


12th Grade 




3 


1 


2 


4 



1 Entries in table arc the number of items 



Source: National Education Longitudinal Study of 1988: Second Fbllow-Up, U.S. Department of Education. National Center for 
Education Statistics. 
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The science tests were only grade level adaptive. That is, everyone within grade received the same form. 
The higher grade level forms (tenth and twelfth) were modified by adding more advanced material to 
minimize ceiling effects. Tables 2.8 and 2.9 present the P+'s and biserials for the items in each grade 
level form for the total population. Appendices C-l to C-3 show the P+'s and biserials for gender and 
racial/ethnic groups. Appendix E-3 presents the IRT parameters for the science test. 



Table 2.8 
Science: Proportion Correct 



Item No. 


Base Year 


First Follow-up 


Second Follow-up 


Item 1 


.70 






1 LCI 11 £* 


7Q 






ILCIil D 


.LTT 


7? 




Item 4 


.0 / 


74 


78 


ILCIil j 


. /u 


78 


81 


ILCIil O 


. /u 


84 

.OH 


88 

• OO 


Ttf»m 7 
llvlll / 








Itf»m R 

llvlll O 








Ttf>m 0 

llvlll 7 


64 






Item 10 

1 IV111 1 \J 


.53 


.59 


.65 


Item 11 


.48 






Item 12 


.66 


.73 


.73 


Item 13 


.72 






Item 14 


.53 


.65 


.70 


Item 15 


.39 


.54 


.56 


Item 16 


.46 


.56 


.58 


Item 17 


.42 


.57 


.63 


Item 18 


.45 


.58 


.65 


Item 19 


.42 


.54 


.59 


Item 20 


.41 


.50 
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Table 2.8 

Science: Proportion Correct (co ! 'd) 



Item No. 


Base Year 


First Follow-up 


Second Follow-up 


Item 21 


.42 


.51 




Item 22 


.37 


.46 


.47 


Item 23 


.39 


.50 




Item 24 


.33 


.42 


.45 


Item 25 


.22 


.32 




Item 26 




.52 


.61 


Item 27 




.28 


.32 


Item 28 






.73 


Item 29 




.49 


.58 


Item 30 




.50 


.58 


Item 31 






.59 


Item 32 




.26 


.34 


Item 33 




.56 


.64 


Item 34 




.47 




Item 35 






.43 


Item 36 






.43 


Item 37 






.29 


Item 38 






.13 


Mean 


.54 


.55 


.57 


S.D. 


.15 


.14 


.17 


Unwtd 


23616 


17684 


14134 


Wtd N 


2889974 


2849102 


2262896 



Source: National Education Longitudinal Study of 1988: Second Follow-Up. U.S. Department of Education. National Center for 
Education Statistics. 
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Table 2.9 
Science: R-Biserial 



Item No. 


Base Year 


First Follow-up 


Second Follow-up 


Item 1 


.57 






Item 2 


.51 






Item 3 


.48 


.53 




Item 4 


.45 


.51 


.53 


Item 5 


.71 


.71 


.70 


Item 6 


.67 


.70 


.67 


Item 7 


.50 






Item 8 


.46 






Item 9 


.51 






Item 10 


.53 


.60 


.65 


Item 11 


.41 






Item 12 


.57 


61 


.63 


Item 13 


.54 






Item 14 


.65 


.71 


.73 


Item 15 


.47 


.49 


.47 


Item 16 


.42 


.52 


.54 


Item 17 


.49 


.66 


.71 


Item 18 


.54 


.61 


.61 


Item 19 


.50 


.60 


.62 


item 20 


.35 


.47 




Item 21 


.39 


.49 




Item 22 


.38 


.46 


.46 


Item 23 


.27 


.38 




Item 24 


.56 


.59 


.62 


Item 25 


.37 


.51 
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Table 2.9 
Science: R-Biserial (cont'd) 



Item No. 


Base Year 


First Follow-up 


Second Follow-up 


Item 26 




.60 


.64 


Item 27 




.55 


.65 


Item 28 






.52 


Item 29 




.63 


.69 


Item 30 




.55 


.60 


Item 31 






.50 


Item 32 




.56 


.67 


Item 33 




.62 


.65 


Item 34 




.44 




Item 35 






.56 


Item 36 






.33 


Item 37 






.31 


Item 38 






.26 


Mean 


.49 


.56 


' .57 


S.D. 


.10 


.08 


.12 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. 

History/Citizenship/Geography 

Tables 2.10, 2.1 1 and 2.12 present the item content specifications, P+*s and biserials respectively. 

Table 2.10 

NELS:88 History Specifications Content by Test Forms 





Cit/Govt 


Am Hist 


Geog 


8th Grade 


13 


14 


3 


10th Grade 


8 


19 


3 


12th Grade 


12 


15 


3 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. 
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Table 2.11 

History/Citizen/Geography: Proportion Correct 







First Follow-UD 


Second Follow-up 


Item 1 




.83 


.89 


Item 2 


.49 


.64 


.66 


Item 3 




.63 




Item 4 


.48 


.56 




Item 5 


.55 


.68 


.71 


Item 6 


.43 


.50 


.54 


Item 7 


.77 


.83 




Item 8 


.58 


.67 


.76 


Item 9 


.42 


.52 


.59 


Item 10 


.47 


.52 


.61 


Item 11 


.45 


.44 


.57 


Item 12 






.41 


Item 13 


.48 


.53 


.65 


Item 14 


.78 


.80 




Item 15 


.66 


.72 


.80 


Item 16 


.90 


.91 




Item 17 


.30 


.85 




Item 18 


.24 


.28 


.56 


Item 19 


.84 


.91 


.96 


Item 20 






.43 


Item 21 


.35 


.44 


.59 


Item 22 


.86 






Item 23 


.84 






Item 24 


.91 






Item 25 


.88 
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Table 2.11 

History/Citizen/Geography: Proportion Correct (cont'd) 





TJocq Veal* 


Mr II SI rUllvTT'*UU 






91 






Item 27 


.76 


.80 


.91 


Item 28 






.52 


Item 29 


.66 


.74 




Item 30 


.70 


.81 




Item 31 


.54 


.67 


.78 


Item 32 




.32 


.43 


Item 33 


.47 


.60 


.72 


Item 34 


.59 


.51 




Item 35 




.71 




Item 36 






.25 


Item 37 


.52 


.56 


.68 


Item 38 




.45 




Item 39 




.42 




Item 40 






.63 


Item 41 






.70 


Item 42 






.56 


Item 43 






.64 


Item 44 






.55 


Item 45 






.29 


Item 46 






.35 


Item 47 






.20 


Mean 


.63 


.63 


.60 


S.D. 


.19 


.17 


.18 


Unwtd N 


23525 


17591 


14063 


Wtd N 


2880468 


2841095 


2253399 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. 
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Table 2.12 

History/Citizenship/Geography: R-Biserial 





Base Year 


First Follow-up 


Second Follow-up 


Item 1 


.63 


.66 


.67 


Item 2 


.53 


.62 


.68 


Item 3 




.40 




Item 4 


.57 


.67 




Item 5 


.53 


.58 


.58 


Item 6 


.48 


.59 


.68 


Item 7 


.66 


.72 




Item 8 


.59 


.67 


.69 


Item 9 


.42 


.46 


.54 


Item 10 


.60 


.63 


.69 


Item 11 


.47 


.49 


.61 


Item 12 






.44 


Item 13 


.50 


.52 


.57 


Item 14 


.59 


.62 




Item 15 


.61 


.61 


.63 


Item 16 


.76 


.78 




Item 17 


.58 


.64 




Item 18 


.29 


.46 


.69 


Item 19 


.64 


.68 


.56 


Item 20 






.53 


Item 21 


.36 


.59 


.71 


Item 22 


.61 






Item 23 


.49 






Item 24 


.78 






Item 25 


.67 
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Table 2.12 

History/Citizenship/Geography: R-Biserial (cont'd) 





Base Year 


First Follow-up 


Second Follow-up 


Ttatti OA 

iiem zo 








Item 27 


,74 


.77 


.74 


Item 28 






.49 


Item 29 


.60 


.69 




Item 30 


.48 


.58 




Item 31 


.55 


.60 


.66 


Item 32 




.52 


.55 


Item 33 


.48 


.55 


.60 


Item 34 


.64 


.62 




Item 35 




.46 




Item 36 






.28 


Item 37 


.61 


.65 


.68 


Item 38 




.44 




Item 39 




.31 




Item 40 






.60 


Item 41 






.46 


Item 42 






.60 


Item 43 






.65 


Item 44 






.50 


Item 45 






.48 


Item 46 






.42 


Item 47 






.30 


Mean 


.58 


.59 


.58 


S.D. 


.11 


.11 


.11 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. 
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There was no attempt to design process specifications into the H/C/G test. Appendices D-l to D-3 show 
the P+'s and biserials for gender and racia^ethnic groups. Appendix E-4 presents the IRT parameters for 
the H/C/G test. 

In summary, for almost all content areas the average P+'s for the grade level forms and the forms 
within grade are in the targeted middle ranges, i.e., .45 to .65. This is a desirable range because maximal 
discrimination in the sense of differentiation between people occurs at the P+ of .5. The one exception 
is the high level mathematics form in the tenth grade. The high level tenth grade mathematics form turned 
out to be easier than predicted from the field test statistics. This tendency for some potential ceiling 
effects in the high tenth grade mathematics form was somewhat reduced when all three time points were 
pooled and Bayesian IRT procedures applied which tend to "shrink" in both item parameters and scores 
within subpopulations. This Bayesian procedure will be discussed in more detail in the next section. 

The biserials were pretty much on target yielding for the most part quite respectable averages, i.e., 
.50 or greater for most test forms. This is a desirable target since experience suggests that tests that 
achieve this average biserial level tend to approach test reliabilities in the middle eighties with as few as 
20 items. 
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Chapter 3 

IRT Scaling for Longitudinal Measurement 
and Equating to Earlier Cohorts 

In order to accurately measure the extent of cognitive gains at both the group and individual level, 
the eighth grade tests and the various forms of the tenth and twelfth grade tests must be calibrated on the 
same scale. The most convenient way of doing this is to use Item Response Theory (IRT). In order to 
successfully carry out such a calibration, the eighth, tenth, and twelfth grade items should be relatively 
unifactorial within a subject area, say mathematics or reading, with the same dominant factor underlying 
all test forms. This suggests that there should be a common set of anchor items across adjacent forms and 
that most, but not necessarily all, content areas be represented in all grade forms. Increments in difficulty 
demanded in ascending grade forms (8, 10, 12) can be accomplished by: (1) increasing the problem- 
solving demands within the same familiar content areas and (2) including content in the later forms (in 
particular twelfth grade) that tap materials normally found in the advanced course sequence but build on 
skills learned earlier in the sequence. 

As indicated earlier, Item Response Theory (IRT, see Lord, 1980) was used in calibrating the 
various forms within each content area. A brief background on IRT follows with additional information 
on the Bayesian approach taken here. 

The underlying assumption of Item Response Theory (IRT) is that a test taker's probability of 
answering an item correctly is a function of his or her ability level for the construct being measured, and 
of one or more characteristics of the test item itself. The three-parameter IRT logistic model uses the 
pattern of right, wrong, and omitted responses to the items administered in a test form, and the difficulty, 
discriminating ability, and "guess-ability" of each item, to place each test taker at a particular point, 9 
(theta), on a continuous ability scale. Figure 3.1 shows a graph of the logistic function for a hypothetical 
test item. The horizontal axis represents the ability scale, theta. The point on the vertical probability axis 
corresponding to the height of the curve at a given value of theta is the estimated probability that a person 
of that ability level will answer the test item correctly. The shape of the curve is given by the following 
equation describing the probability of a correct answer on item i as: 

P W- c i + ~ -i.m.aw 

1 + e * v 

where 9 = ability of the test taker 

aj = discrimination of item i, or how well the item distinguishes between ability levels at a 

particular point 

b; = difficulty of item i 

c ; = "guessability" of item i 

The "c" parameter represents the probability that a test taker with very low ability will answer the 
item correctly. In the graph above, 20% of test takers with a very low level of mastery of the test material 
guessed the correct answer to the question. The c parameter will not necessarily be equal to l/(# options), 
e.g., .25 for a 4-choice item. Some response options may, for unknown reasons, be more attractive than 
random guessing, while others may be less likely to be chosen. 
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Figure 3.1 



Probability of Correct Answer 



0.8 
£ 0.6 

0.2 



a = 1.5 
slope = .51 



c=.20 



i Jq -h ± -h 




b = D.O 



2 -i 0 i 
Thota (Ability) 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. 

The IRT "b" parameters correspond to the difficulty of the items, represented by the horizontal axis 
in the ability metric. In Figure 3.1, b = 0.0 means that test takers with 0 = 0.0 have a probability of 
getting the answer correct that is equal to halfway between the guessing parameter and 1 . In this example, 
60% of people at this ability level answered the question correctly. B also corresponds to the point of 
inflection of the logistic function. This point occurs farther to the right for more difficult items, and 
farther to the left for easier ones. Figure 3.2 is a graph of the logistic functions for seven different test 
items, all with the same "a" and "c" parameters, and with difficulties ranging from b = -1.5 to b = 1.5. 
For each of these hypothetical questions, 60% of test takers whose ability level matches the difficulty of 
the item are likely to answer correctly. Fewer than 60% will answer correctly at values of theta (ability) 
that are less than b, and more than 60% at 8 > b. 

The discrimination parameter, "a", has perhaps the least intuitive interpretation of all. It is 
proportional to the slope of the logistic function at the point of inflection. Items with a steep slope are 
said to discriminate well. In other words, they do a good job of discriminating, or separating, people 
whose ability level is below the calibrated difficulty of the item (who arc likely to get it right at only 
about the guessing rate) from those of ability higher than the item "b", who arc nearly certain to answer 
correctly. By contrast, an item with a relatively flat slope is of little use in determining whether a person's 
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Figure 3.2 
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Source: National Education Longitudinal Study of 1988: Second Follow -Up, U.S. Department of Education, National Center for 
Education Statistics. 

correct placement along the continuum of ability is above or below the difficulty of the item. This idea 
is illustrated by Figure 3.3, representing the logistic functions for two test items having the same difficulty 
and guessing parameters, but different discrimination. The test item with the steeper slope (a = 2.0) 
provides useful information with respect to whether the test taker's ability level is above or below the 
difficulty level, 1.0, of the item: if the answer to this item was incorrect, the person very likely has an 
ability below 1.0; if the answer is correct, the test taker probably has a 9 greater than 1.0, or guessed 
successfully. A series of many such highly discriminating items, with a range of difficulty levels (b 
parameters) such as those shown in Figure 3.2, will do a good job in narrowing the choice of probable 
ability level. Conversely, the flatter curve in Figure 3.3 represents a test item with a low discrimination 
parameter (a=.3). There is little difference in proportion of correct answers for test takers several points 
apart on the range of ability. So knowing whether a person's response to such an item is correct or not 
contributes relatively little to pinpointing his or her correct location on the horizontal ability axis. 

B1LOG or PARSCALE (Muraki & Bock, 1991) computer programs compute marginal maximum- 
likelihood estimates of IRT parameters that best fit the responses given by the test takers. The procedure 
calculates a, b, and c parameters for each test item, iterating until convergence within a specified level of 
accuracy is reached. Comparison of the IRT-estimated probability with the actual proportion of correct 
answers to a test item for examinees grouped by ability provides a means of evaluating the appropriateness 
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Figure 3.3 
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Education Statistics. 

of the model for the set of test data for which it is being used. A close match between the IRT-estimated 
curves and the actual data points means that the theoretical model accurately represents the empirical data. 

Once a pool of test items exists whose parameters have been calibrated on the same scale as the 
test takers' ability estimates, a person's probability of a correct answer for each item in the pool can be 
computed, even for items that may not have been administered to that individual. The IRT-estimated 
number correct for any subset of items is simply the sum of the probabilities of correct answers for those 
items. Consequently, the score is typically not a whole number. 

In addition to providing a mechanism for estimating scores on items that were not administered 
to every individual, IRT has advantages over raw number-right scoring in the treatment of guessed and 
omitted items. By using the overall pattern of right and wrong responses to estimate ability, it can 
compensate for the possibility of a low ability student guessing several hard items correctly. If answers 
on several easy items are wrong, a correct difficult item is, in effect, assumed to have been guessed. 
Omitted items are also less likely to cause distortion of scores, as long as enough items have been 
answered right and wrong to establish a clear pattern. Raw number-right scoring, in effect, treats omitted 
items as if they had been answered incorrectly. While this may be a reasonable assumption in a motivated 
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test, where it is in students' interest to try their best on all items, this may not always be the case in 
NELS:88. 

As indicated earlier, a longitudinal growth study by its very nature consists of subpopulations 
defined by differing ability levels. That is, after all the assessments have been completed (three 
assessments in NELS:88) there are at least three recognizable subpopulations of different ability levels, 
which are tied to the time of testing. For example, the base year subpopulation will have, on average, a 
lower expected level of performance, than that found in each of the remaining two follow-ups. Similarly 
the average performance of the tenth graders will be lower than that of the twelfth graders. For those 
content areas in which multilevel adaptive testing was implemented, there are more than three definable 
ability level populations. In mathematics there were seven forms differing in difficulty, and thus there are 
seven ability groups which could be expected to differ in performance. In reading there were five forms, 
and thus the potential for having five subpopulations with differing levels of performance. 

In the past, when LOGIST (Wingersky, Barton & Lord, 1982) was the only reliable and 
documented three parameter computer program applicable in this area, one psychometrically acceptable 
procedure for vertical scaling in a longitudinal study would be to estimate the base year item parameters 
and fix their values at their base year quantities. When the first follow-up becomes available, item 
parameters would be estimated for only those items unique to the first follow-up. The scale is anchored 
by the items that were common to both the base year and the first follow-up, and which had their values 
fixed at their base year quantities. Variations that are improvements on this approach might include 
pooling the two waves of data and re-estimating all item parameters using all the available data and then 
using common item equating approaches such as the Stocking & Lord (1983) transformation to find 
linking constants that optimally match proportion correct on the item pool conditional on the scale (ability) 
scores. This second approach uses all the data in estimating the item parameters and thus could be 
expected to yield more stable item parameter estimates. The pooling of all time points and re-estimating 
the item parameters, of course can lead to a re-making of history in a longitudinal study where 
intermediate reports are published before all the data from all the time periods is available. That is, eighth 
grade scores that have been reported and analyzed might later be modified when the tenth and twelfth 
grade data became available. The use of all data points over time, however, is the preferable method 
because it is the one method which can provide stable estimates of both the item traces and latent trait 
scores throughout the entire ability distribution. This procedure was used in the vertical equating that was 
carried out for the High School and Beyond (Rock et al., 1985; Rock & Pollack, 1987). 

The major problem with the above LOGIST approaches is that there is no easy way to incorporate 
into the item parameters and latent trait score estimation procedure prior knowledge about what ability 
distribution an individual comes from. This shortcoming is particularly crucial in its impact on measuring 
change in longitudinal studies. The inability of LOGIST and/or other non-Bayesian approaches to IRT 
is that they have no acceptable way of coping with "perfect" i.e., all correct scores. For example, some 
very advanced individuals who took the high level mathematics form in grade ten got all the items correct. 
In conditional maximum likelihood approaches such as LOGIST, such scores are undefined or are given 
some arbitrary high value. Yet we know these individuals, while gifted, probably will not get perfect 
scores when they eventually take the high level twelfth grade form. Does this mean that they are less 
knowledgeable in grade 12 than in grade 10? Probably not. In fact almost nobody got all the items 
correct in the "hardest" form in twelfth grade. Thus if they had been given the hard items from the 
twelfth grade "high" form when they were tenth graders they would indeed have had less than perfect 
scores, and if the same set of items were repeated they would more than likely show gains. 
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Pooling all three time points, which amounts to pooling all the items as well as people (in a sense 
pooling all available information) and recomputing all the item parameters using Bayesian priors reflecting 
the ability distributions associated with each particular test form, provides for an empirically based 
shrinkage to more reasonable item parameters and ability scores (Muraki & Bock, 1991). The fact that 
the total item pool is used in conjunction with the Bayesian priors leads to shrinking back the extreme 
item parameters as well as the perfect scores to a more reasonable quantity, which in turn allows for the 
potential of some gains even in the uppermost tail of the distribution. Each of the test forms (the eighth, 
tenth and twelfth grade forms, and in the case of reading and math, the multiple forms within year) is 
treated as a separate subpopulation with its own ability distribution. The amount of shrinkage is a function 
of the distance from the subgroup means and the relative reliability of the score being estimated. 
Theoretically this approach has much to recommend it. In practice, it has to have reasonable estimates 
of the difference in ability levels among the subpopulations in order to incorporate realistic priors. 
Essentially, the scales are determined by the linking items, and the initial prior means for the subgroups 
are in turn determined by the differential performance of the subpopulations on these linking items. For 
this reason we have designed the item pool to have an overabundance of items linking forms. This 
approach, using adaptive testing procedures combined with Bayesian procedures that allow for priors on 
both ability distributions and on the item parameters, is needed in longitudinal studies to minimize ceiling 
and floor effects. 

A multiple group version of the PARSCALE computer program (Muraki & Bock, 1991) that was 
developed for NAEP allows for both group ability priors and item priors. A publicly available multiple 
group version of the BILOG (Mislevy & Bock, 1982) computer program called BIMAIN (Muraki & Bock, 
1987, 1991) has many of the same capabilities for dichotomously scored items only. Since the 
PARSCALE program was applied to dichotomously scored items in the NELS:88 vertical scaling, its 
estimation procedure is identical to the multiple group version of BILOG or BIMAIN. PARSCALE uses 
a marginal maximum likelihood estimation approach and thus does not estimate the individual ability 
scores when estimating the items parameters but assumes that the ability distribution is known for each 
subgroup. Thus the posterior distribution of item parameters is proportional to the product of the 
likelihood of observing the item response vector, based on the data and conditional of the item parameters 
and subgroup membership, and the assumed prior ability distribution for that subgroup. More formally, 
the general model in terms of item estimation is the same as that used in NAEP and described in some 
detail by Yamamoto & Mazzeo (1992; p. 158) as follows: 



KP)= n,Ik / 0 Pi.Xj.JQ^mdiQ) 



(i) 



In equation (1), P(x. 1 8, P) is the conditional probability of observing a response vector* 



of person j from group g, given proficiency 8 and vector of item parameters 
P = (a l Jb v c l ,..../lJ>„cp, and f f (B) is a population density for 8 in group g. Prior 

distributions on item parameters can be specified and used to obtain Bayes modal estimates 
of these parameters (Mislevy, 1984). The proficiency densities can be assumed known and 
held fixed during item parameter estimation or can be estimated concurrently with item 



parameters. 
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The f g ($) in (1) are approximated by multinomial distributions over a finite number of 
quadrature points, where X k for k= 1,...^, denotes the set of points and A (X.) are the 
multinomial probabilities at the corresponding points that approximate f g (Q) at 6 = X k . 
If the data are from a single population with an assumed normal distribution, Gauss- 
Hermite quadrature procedures provide an optimal set of points and weights to best 
approximate the integral in (1) for a broad class of smooth functions. For more general 
/ or for data from multiple populations with known densities, other sets of points (e.g., 
equally spaced points) can be substituted, and the values of A(X.) may be chosen to be 
the normalized density at point X k (i.e., A g (X k ) = f g (X k )fe k f g (X k )). 

Maximization of L(p) is carried out by an application of an EM algorithm (Dempster, 
Laird & Rubin, 1977). When population densities are assumed known and held constant 
during estimation, the algorithm proceeds as follows. In the E step, provisional estimates 
of item parameters and the assumed multinomial probabilities are used to estimate expected 
sample sizes at each quadrature point for each group (denoted N gk ), as well as over all 
groups (denoted N k = E /5^ t ). These same provisional estimates are also used to 
estimate an expected frequency of correct responses at each quadrature point for each 
group (denoted f glk ), and over all groups (denoted f tt = E^ r gtk ). In the M step, 

improved estimates of the item parameters arc obtained by treating the N gk and r ft as 
known and carrying out maximum likelihood logistics regression analysis to estimate the 
item parameters p , subject to any constraints associated with prior distributions specified 
for p. 

The user of the multiple group version of PARSCALE has the option of fixing the priors on the 
ability distribution or allowing the posterior estimate to update the previous prior and combine with the 
data-based likelihood to arrive at a new set of posterior estimates after each major EM cycle. If one 
wishes to update on each cycle, one can continue to constrain the priors to be normal or their shape can 
be allowed to vary. The NELS:88 approach was to allow for updating the prior but with the normality 
assumption. It was our experience that the "smoothing" that came from the updated normal priors led to 
less "jagged" looking ability score distributions and did not tend to overfit the item parameters. It has 
been our experience that lack of fit in the item parameter distribution would simply be absorbed in the 
shape of the ability distribution if the updated ability distribution were allowed to take any shape. A 
similar procedure was used in estimating the item parameters in the National Adult Literacy Study (NALS) 
(Kirsch et al. 1993). 

Appendices E-l to E-4 present the final item parameters for each of the content areas. The 
location of each item within each test form is also given, as well as the number of possible answer choices 
for each. Table 3.1 summarizes the means, standard deviations and ranges of the item parameters by 
content areas. 
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Table 3.1 

Means, Standard Deviations and Ranges of IRT Parameters 





Number 
of Items 


Mean 


S.D. 


Low 


High 


Reading 


A 


54 


0.9052 


0.2901 


0.3219 


1.7607 


B 


54 


0.0755 


1.0757 


-2.5174 


2.3409 


C 


54 


0.1494 


0.1135 


0.0000 


0.4523 


Math 




A 


81 


0.9529 


0.3119 


0.4168 


2.1455 


B 


81 


0.2987 


1.4750 


-2.9487 


3.2030 


C 


81 


0.1558 


0.1091 


0.0000 


0.4388 


Science 


A 


38 


0.8778 


0.3186 


0.3269 


1.5459 


B 


38 


0.0387' 


1.0006 


-1.9340 


2.4048 


C 


38 


0.1850 


0.1280 


0.0000 


0.3886 


History 


A 


47 


1.0812 


0.3802 


0.2955 


2.0344 


B 


47 


-0.1899 


1.2413 


-2.6938 


2.2582 


C 


47 


0.2187 


0.1286 


0.0000 


0.5162 



Source: National Education LongitudinaLStudy of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. /' 

With respect to interpreting the item parameters, "a" parameters (the discrimination parameter) 
should each be over .50. "a" parameters in the neighborhood of 1.0 or above are considered very good. 
As described earlier, the a parameter indicates the usefulness of the item i a discriminating between points 
on the ability scale. The b parameter, item difficulty, should span the ra.ige of abilities being measured. 
Item difficulties should be concentrated in the range of abilities that contains most of the test takers. Test 
items provide the most information when their difficulty is close to the ability level of the examinees. 
Items that are too easy or too difficult for most of the test takers are of little use in discriminating between 
them. Ideally the "c" parameter (the probability of a low ability person guessing correctly) should be less 
than .25 for four choice items, but they may vary with difficulty, and of course the number of options. 
Most content areas had a mixture of four choice and five choice items. The H/C/G test had some two 
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choice items, and thus the somewhat elevated guessing parameters. In general, the item parameters meet 
these standards. 

It should be remembered that the solution to equation 1 above finds those item parameters that 
maximize the likelihood across all groups (forms): seven in mathematics, five in reading, and three each 
in science and H/C/G. The present version of the multiple group PARSCALE only saves the 
subpopulation means and standard deviations and not the individual expected a posteriori (EAP) scores. 
The individual EAP scores which are the means of the posterior distributions of the latent variate, were 
obtained from the bgroup conditioning program which uses the Gaussian quadrature procedure. This 
variation is virtually equivalent to conditioning (e.g., see Mislevy, et al. 1992) on a set of "dummy" 
variables defining v» liich ability subpopulation an individual comes from. The one difference is that the 
group variances are not restricted to be equal as in the standard conditioning procedure. 

In summary, equation one finds the item parameters that maximize the likelihood function across 
all groups (forms and grades) simultaneously. The items can be put on the same vertical scale because 
of the linking items that are common to either adjacent forms or some subset of forms. Using the 
performance on the common items the subgroup means can be located along the vertical scale. Since 
marginal maximum likelihood estimation requires only an assumed ability density function in the 
estimation of item parameters, individual ability scores are not estimated in the item parameter estimation 
step, only the subgroup means and variances are estimated. The bgroup program then estimates the 
individual ability scores as the mean of an individual's posterior distribution. The posterior distributions 
for each individual at any given step in the bgroup iteration are the product of the likelihood of observing 
that pattern of "0"'s and "l"'s in the item response vector conditional on the item parameters and subgroup 
membership and the prior ability distribution. The prior ability distributions are assumed normal with a 
mean and variance from their subgroup. At each succeeding step in the iterations the previous posterior 
distribution becomes the new prior until the iterations converge. 

Conditional independence is an assumption of all IRT models, but as Mislevy, et al., (1992) point 
out, not likely to be generally true. However, if one thinks of IRT-based scores as a summarization of 
essentially the largest latent factor underlying a given item pool, then small violations are of little 
significance. To insure that there were no substantive violations of this assumption, factor analyses were 
carried out on the grade 8 forms to insure a large dominant factor underlying each content area. These 
results were reported by Rock & Pollack (1987). Since students in the tenth and twelfth grade took 
different lbrms, factor analysis wvs no longer appropriate. However, all item traces were inspected to 
insure a good fit throughout the ability range. More importantly, estimated proportions correct by item 
by grade were also estimated in order to insure that the IRT model was both reproducing the item P+'s 
and there was no particular bias in favor of any particular grade. Since the item parameters were 
estimated using a model that maximizes the goodness-of-fit across the subpopulations, including grades, 
one would not expect much difference here. When the differences were summed across all items for each 
test, the maximum discrepancy between observed and estimated proportion correct for the whole test was 
.7 of a scale score point for grade twelve mathematics whose score scale had a range of 0 to 81. The IRT 
estimates tended to slightly underestimate the observed proportions. However, no systematic bias was 
found for any particular grade. Appendices F-l to F-4 provide discrepancies by item as well as for totals 
aggregated across all items. 

Differential Item Functioning (DIF) 

Differential Item Functioning (DIF) as defined here attempts to identify those items showing an 
unexpectedly large difference in item performance between a focal group (e.g. Black students) and a 
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reference group (e.g. White students) when the two groups are "blocked" or matched on iheir total score. 
It should be noted that any such strictly internal analysis, i.e., without an external criterion, cannot detect 
bias when that bias pervades all items in the test (Cole & Moss, 1989). It can only detect differences in 
the relationships among items that are anomalous in some group in relation to other items. In addition 
such approaches can only identify the items where there is unexpected differential performance, they 
cannot directly imply bias. A determination of bias implies not only that differential performance on the 
item is related to subgroup membership, but also that the difference is unfairly associated with subgroup 
membership. That is, the difference is due to an attribute not related to the construct being measured. 
As Cole & Moss (1989) point out, items so identified must still be interpreted in light of the intended 
meaning of the test scores before any conclusion of bias can be drawn. It is not entirely clear how the 
term item bias applies to academic achievement measures given to students with different patterns of 
exposure to content areas. For example, some students may take more algebra after eighth grade while 
another group may take less algebra and more geometry. Both groups may have similar tote' scores but 
for one group the algebra may be differentially difficult while the reverse is true for the other group. It 
is ETS' practice to carry out DEF analysis on all tests they design in order to detect test items with 
differential performance for subgroups defined by gender and ethnicity. 

The DEF program was developed at Educational Testing Service (Holland and Thayer, 1986) and 
was based on the Mantel-Haenszel odds-ratio (Mantel and Haenszel, 1959) and its associated Chi-Square. 
Basically, the Mantel-Haenszel (M-H) procedure forms odds ratios from two-way frequency tables. In 
a twenty item test, 21 two-way tables and their associated odds-ratios can be formed for each item. There 
are potentially 21 of these tables for each item since there will be one table associated with each total 
score from 0-20. The first dimension of each table is groups, e.g., Whites vs. Blacks, and the remaining 
dimension is passing vs. failing on a given item. Thus the question that the M- H procedure addresses 
itself to is whether or not members of the reference group, e.g., Whites, who have the same total score 
as members of the focal group, e.g., Blacks, have the same likelihood of passing the item in question. 
While the M-H statistic looks at passing rates for two groups while controlling for total score, no 
assumption need be made about the shape of the total score distribution for either group. The chi-square 
statistic associated with the M-H procedure tests whether the average odds-ratio for a test item, aggregated 
across all 21 score levels differs from unity, i.e., equal likelihood of passing. 

The M-H procedure provides a statistical test of whether or not the average odds-ratio significantly 
departs from unity for each item. If the probability is .05 or less, then one could say that there is 
statistical evidence for DBF on the item in question. The problem with this interpretation is two-fold. 
First, one is making a large number of statistical tests, one for each item, so low probabilities will be 
found occasionally even if no DDF is present. Second, if there are two relatively large samples involved, 
statistical significance will be guaranteed. 

Given these reservations, Educational Testing Service has developed an "effect size" estimate that 
is not sample size dependent. Associated with the effect sizes is a letter code that ranges f.om "A" to "C". 
It is ErS's experience that effect sizes of 1.5 and above have practical significance. Effect sizes of this 
magnitude, and which are statistically significant, are labelled with a "C". Items labelled "A" or r B" either 
do not show statistically significant differential functioning for the two groups being compared, or have 
differences that are too small to be important. Test development experts inspect items that arc 
characterized by such large DIF properties, and in some cases are able to identify the reason, other than 
bias, for the differential item functioning. 

If DIF statistics have been obtained on pretested items, ah "C" items will normally be replaced in 
construction of an operational test, unless they are needed to meet test specifications. This is done 
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regardless of whether the group differences are related to the construct. Once a test has been administered, 
however, replacement of items is no longer an option; the only choice possible is whether to accept the 
questioned item or drop it from scoring. At this stage, it has been the policy of the Educational Testing 
Service to submit items having "C" level DIF statistics to a test development committee for review. If 
the committee can identify content that is likely to be unfamiliar to the subgroup in question and which 
is irrelevant to the skill being measured the item will typically be removed from the test score. However, 
if the identified source of difference is consistent with the construct being measured, or if no reason for 
the difference can be determined, the item is retained. 

Table 3.2 presents a summary of the DIF results for the various subpopulations. The bottom of 
the table presents a summary of the number of "C" level DIF's accumulated across all content levels. 
Twenty-four items in total favored the reference groups while fifteen favored the focal groups. These two 
proportions do not differ significantly. This result, along with the fact that one might expect up to five 
percent occurrences by chance alone suggests that there is little potential DDF in the NELS:88 battery. 



Speededness 

Table 3.3 presents speededness indices for the gender, racial/ethnicity groups and totals. The 
speededness index presented here is the percentage of students in each group who attempt the last item. 
If over 80% attempt the last item the test is not assumed to be speeded, that is, differences in test 
performance are ji-dged not to be due to time constraints. To a certain extent the proportion attempting 
the last item is at best an approximate estimate of speededness and likely to be biased in the direction of 
showing speededness when it is not present. One reason for this is that the items at the end of the test 
form tend to be the most difficult. As items near the end increase in difficulty, they may not be attempted 
by the less advanced students, and the speededness index would infer that the test is speeded rather than 
just, having items towards the end that are too difficult for some test takers. Another reason for not 
answering one or more items at the end of the test might be lack of motivation to complete a test for 
which the student will be neither rewarded nor punished. Inspection of Table 3.3 suggests that there 
appears to be little problem with speededness. Not unexpectedly, speededness indices for the twelfth grade 
high math form fell below 80% for some subgroups. This form had five very difficult items at the very 
end. Another speededness index defines a test as not being speeded if "almost all" test takers complete 
80% of the test. This definition is not affected by clusters of hard items at the end of the test. When this 
criterion was applied, the percentages completing at least 80% of the test exceeded 95% for virtually all 
subgroups and this finding was consistent for all grade levels. The vast majority of students who took 
the NELS:88 tests answered all of the questions. There is little indication that time constraints 
differentially affected scores for any gender or racial/ethnic subgroup. 
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Table 3.2 
Counts of "C" Level DIF Items 



Group Favored 


Reading 


Math 


Science 


History 


Total 


Base Year 


White (Reference Group) 


0 


0 


0 


1 


1 


Asian (Focal Group) 


0 


0 


0 


1 


1 


White (Reference Group) 


0 


0 


0 


0 


0 


Hispanic (Focal Group) 


0 


0 


0 


1 


1 


White (Reference Group) 


0 


1 


1 


0 


2 


Black (Focal Group) 


0 


0 


0 


1 


1 


Male (Reference Group) 


0 


1 


0 


1 


2 


Female (Focal Group) 


0 


0 


0 


0 


0 


First Follow-up 


White (Reference Group) 


1 

0 


1 


0 


2 


3 


Asian (Focal Group) 


0 


0 


0 


1 


1 


White (Reference Group) 


0 


0 


0 


1 


1 


Hispanic (Focal Group) 


0 


0 


0 


1 


1 


White (Reference Group) 


0 


2 


0 


0 


2 


Black (Focal Group) 


0 


2 


0 


0 


2 


Male (Reference Group) 


0 


1 


1 


1 


3 


Female (Focal Group) 


0 


0 


0 


0 


0 


Second Follow-up 




White (Reference Group) 


0 


2 


6 


2 


4 


Asian (Focal Group) 


1 


1 


0 


3 


5 


White (Reference Group) 


0 


0 


0 


1 


1 


Hispanic (Focal Group) 


0 


0 


0 


1 


1 


White (Reference Group) 


0 


1 


0 


0 


1 


Black (Focal Group) 


1 


0 


0 


0 


1 


Male (Focal Group) 


1 


2 


1 


0 


4 


Female (Focal Group) 


0 


1 


0 


0 


1 ., 
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Table 3.2 

Counts of "C" Level DIF Items (cont'd) 



Summary 


# Favoring 
Ref Group 


# Favoring 
Focal Group 


Total # C 
Items 


Total Items 
in Pool 


x4 
Contrasts 


% of C- 
DIF Items 


Base Year 


5 


3 


8 


116 


464 


1.7% 


1st Follow-up 


9 


4 


13 


148 


592 


2.0% 


2nd Follow-up 


10 


8 


18 


159 


636 


2.8% 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. 



Table 3.3 
Percentages of Selected Subgroups 
Who Attempted the Last Item for Each Cognitive Test 





Total 


Male 


Female 


Asian 


Hispanic 


Black 


White 


Base Year 


Reading 


96% 


95% 


96% 


96% 


93% 


90% 


97% 


Math 


95% 


95% 


95% 


96% 


93% 


90% 


96% 


Science 


97% 


97% 


98% 


97% 


96% 


94% 


98% 


History 


98% 


98% 


98% 


97% 


97% 


97% 


99% 


First Follow-up 


Reading Low 


94% 


95% 


94% 


92% 


89% 


90% 


97% 


Reading High 


98% 


98% 


98% 


97% 


96% 


93% 


98% 


Math Low 


97% 


97% 


98% 


99% 


97% 


96% 


98% 


Math Middle 


94% 


94% 


94% 


92% 


90% 


90% 


96% 


Math High 


97% 


97% 


98% 


98% 


94% 


96% 


97% 


Science 


98% 


98% 


98% 


96% 


95% 


96% 


99% 


History 


98% 


98% 


97% 


97% 


95% 


95% 


98% 
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Table 3.3 
Percentages of Selected Subgroups 
Who Attempted the Last Item for Each Cognitive Test (cont'd) 





Total 


Male 


Female 


Asian 


Hispanic 


Black 


White 


Second Follow-up 


Reading Low 


93% 


93% 


93% 


87% 


87% 


90% 


95% 


Reading High 


91% 


91% 


91% 


92% 


83% 


75% 


93% 


Math Low 


98% 


97% 


98% 


94% 


96% 


97% 


99% 


Math Middle 


91% 


92% 


90% 


91% 


87% 


87% 


92% 


Math High 


81% 


82% 


79% 


87% 


69% 


67% 


82% 


Science 


97% 


97% 


97% 


98% 


95% 


95% 


98% 


History 


97% 


97% 


97% 


95% 


93% 


95% 


98% 



Source: National Education Longitudinal Study of 1988: Second Follow-Up 7 U.S. Department of Education, National Center for 
Education Statistics. 
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Motivation 

The analysis above suggests that for those students who attempted the cognitive battery, motivation 
is not a problem. There is still a concern that those students who did not take the cognitive battery for 
whatever reason may not be missing at random particularly in the twelfth grade. Tables 3.4 and 3.5 
present both the unweighted and weighted proportion of students who took cognitive tests in each content 
area, broken 



Table 3.4 

Percentage of Subgroups with Scorable Tests 
Unweighted 



Base Year | N | Reading | Math [ Science | History 

Total 16,489 96.3 96.3 96.2 95.9 

Male 8,140 96.1 96.1 96.1 95.7 

Female 8,349 96.5 96.4 96.3 96.1 



Asian 


976 


96.9 


96.5 


96.4 


96.0 


Hispanic 


2,010 


94.7 


94.4 


94.4 


94.2 


Black 


1,610 


95.0 


95.2 


94.6 


94.4 


White 


11,577 


96.7 


96.7 


96.7 


96.4 


American Indian 


162 


98.8 


98.8 


98.8 


98.8 



Public 


13,640 


96.2 


96.1 


96.0 


95.7 


Catholic 


1,308 


97.0 


97.2 


97.2 


97.0 


NAIS Private 


1,068 


97.5 


97.5 


97.5 


97.5 


Other Private 


473 


96.2 


96.4 


96.2 


95.1 



Quartile 

SES Low 3,793 94.8 94.7 94.8 94.5 

SES Second 3,908 96.1 96.0 96.1 95.7 

SES Third 3,925 96.8 96.8 96.7 96.6 

SES High 1 4,862 | 97.2 [ 97.2 97.0 | 96.7 



a 

ERIC 



\\\ 



51 



Psychometric Report for the NELS:88 
Base Year Through Second Follow-Up 



Table 3.4 

Percentage of Subgroups with Scorable Tests 
Unweighted (cont'd) 



First Follow-up 


N 


Reading 


Math 


Science 


History 


Total 


16,489 


94.2 


94.0 


93.5 


93.0 


Male 


8,140 


93.9 


93.7 


93.2 


92.7 


Female 


8,349 


94.4 


94.2 


93.7 


93.2 




Asian 


995 


93.9 


93.4 


92.7 


92.1 


Hispanic 


2,017 


91.2 


90.8 


89.4 


88.2 


Black 


1,628 


92.0 


91.5 


90.8 


90.0 


White 


11,662 


95.0 


94.9 


94.6 


94.3 


American Indian 


178 


92.1 


92.1 


92.1 


90.4 




Public 


13,594 


95.9 


95.7 


95.2 


94.6 


Catholic 


911 


96.9 


97.1 


97.1 


97.3 


NAIS Private 


966 


93.5 


93.3 


92.7 


92.0 


Other Private 


348 


96.8 


97.1 


97.1 


97.1 


Quartile 


SES Low 


3,671 


90.9 


90.4 


89.3 


88.7 


SES Second 


3,919 


94.3 


94.1 


93.8 


93.2 


SES Third 


3,980 


95.2 


95.1 


94.8 


94.3 


SES High 


4,918 


95.6 


95.6 


95.3 


94.9 




In School 


15,764 


96.0 


95.8 


95.3 


94.8 


Dropout 


631 


53.9 


52.9 


52.1 


52.3 
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Table 3.4 

Percentage of Subgroups with Scorable Tests 
Unweighted (cont'd) 



Second Follow-up 


N 


Reading 


Math 


Science 


History 


Total 


16,489 


77.1 


77.1 


76.6 


76.2 


Male 


8,140 


77.2 


77.2 


76.7 


.76.2 


Female 


8,349 


77.1 


77.0 


76.5 


76.2 




Asian 


995 


77.3 


77.4 


76.9 


76.3 


Hispanic 


2,017 


72.5 


72.5 


72.0 


71.7 


Black 


1,628 


73.1 


73.1 


72.1 


71.6 


White 


11,662 


78.6 


78.6 


78.2 


77.8 


American Indian 


178 


66.9 


67.4 


67.4 


66.3 




Public 


12,585 


81.5 


81.5 


80.9 


80.5 


Catholic 


850 


85.2 


85.2 


84.7 


83.8 


NAIS Private 


930 


78.8 


78.9 


78.8 


78.8 


Other Private 


342 


78.9 


78.7 


78.7 


78.1 


Quartile 


SES Low 


3,663 


71.9 


71.9 


71.3 


70.8 


SES Second 


3,942 


77.7 


77.7 


77.1 


76.8 


SES Third 


4,024 


78.4 


78.3 


77.8 


77.4 


SES High 


4,859 


79.6 


79.6 


79.2 


78.9 




In School 


14,644 


81.6 


81.6 


81.1 


80.7 


Dropout 


1,116 


41.8 


41.3 


41.0 


40.9 



Source: National Education Longitudinal Study of 1988: Second Follow-Up. U.S. Department of Education, National Center for 
Education Statistics. 
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Table 3.5 

Percentage of Subgroups with Scorable Tests 
Weighted 



Base Year 


Wtd N 


Reading 


Math 


Science 


History 


Total 


2,970,835 


96.2 


96.2 


95.9 


■.. 95.6 




Male 


1,492,789 


95.7 


95.7 


95.4 


95.1 


Female 


1,478,047 


96.8 


96.6 


96.3 


96.2 




Asian 


102,531 


96.5 


95.9 


95.2 


95.2 


Hispanic 


306,232 


95.0 


94.6 


94.5 


94.3 


Black 


387,401 


92.4 


92.9 


90.5 


90.2 


White 


2,105,254 


97.1 


96.9 


97.0 


96.8 


American Indian 


36,415 


99.3 


99.3 


99.3 


99.3 




Public 


2,613,787 


96.0 


95.9 


95.6 


95.4 


Catholic 


224,755 


97.5 


97.7 


97.7 


97.5 


NAIS Private 


29,741 


96.4 


96.4 


96.4 


96.4 


Other Private 


102,552 


98.5 


98.6 


98.4 


98.3 


Quartile 


SES Low 


726,089 


95.0 


94.7 


94.8 


94.6 


SES Second 


733,914 


96.1 


96.2 


96.2 


95.8 


SES Third 


744,331 


97.1 


97.1 


96.4 


96.2 


SES High 


766,295 


96.7 


96.6 


96.1 


95.9 
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Table 3.5 

Percentage of Subgroups with Scorable Tests 
Weighted (cont'd) 



First Follow-up 


Wtd N 


Reading 


Math 


Science 


History 


Total 


2,970,835 


91.8 


91.5 


91.0 


90.7 




Male 


1,492,789 


91.8 


91.5 


91.0 


90.7 


Female 


1,478,047 


91.8 


91.6 


91.0 


90.8 




Asian 


105,878 


91.9 


91.4 


90.8 


90.4 


Hispanic 


307,485 


87.9 


87.6 


86.3 


85.2 


Black 


390,455 


86.6 


85.8 


84.2 


84.1 


White 


2,122,702 


93.4 


93.2 


93.0 


92.8 


American Indian 


42,530 


90.6 


91.4 


91.5 


90.1 




Public 


2,493,471 


94.5 


9 A 1 


93.7 


93.3 


Catholic 


168,244 


95.3 


95.0 


95.0 


95.5 


NAIS Private 


33,969 


94.9 


94.8 


94.5 


94.2 


Other Private 


75,608 


91.6 


91.7 


91.7 


91.7 


Quartile 


SES Low 


705,165 


88.2 


87.7 


86.8 


86.4 


SES Second 


734,788 


90.9 


90.6 


90.1 


89.7 


SES Third 


752,009 


93.2 


93.0 


92.7 


92.5 


SES High 


778,667 


94.5 


94.5 


94.2 


94.0 




In School 


2,767,772 


94.5 


94.3 


93.9 


93.5 


Dropout 


181,535 


52.7 


52.0 


51.0 


51.3 
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Table 3.5 

Proportion of Subgroups with Scorable Tests 
Weighted (cont'd) 



Second Follow-up 


Wtd N 


Reading 


Math 


Science 


History 


Total 


2,970,835 


73.7 


73.6 


73.1 


72.8 




Male 


1,492,789 


74.4 


74.4 


73.8 


73.5 


Female 


1,478,047 


72.9 


72.8 


72.3 


72.0 




Asian 


105,878 


77.5 


77.5 


77.1 


76.4 


Hispanic 


307,485 


69.4 


69.3 


68.6 


68.3 


Black 


390,455 


67.6 


67.6 


66.9 


66.8 


White 


2,122,702 


75.4 


75.3 


74.8 


74.5 


American Indian 


42,530 


65.2 


66.0 


66.0 


64.5 




Public 


2,253,702 


79.8 


79.7 


79.1 


78.8 


Catholic 


149,655 


79.6 


79.6 


79.2 


78.6 


NAIS Private 


32,107 


78.8 


78.8 


78.6 


78.8 


Other Private 


69,107 


77.3 


77.1 


77.1 


76.8 


Quartile 


SES Low 


702,256 


67.7 


67.7 


66.9 


66.4 


SES Second 


740,571 


74.0 


73.9 


73.2 


72.9 


SES Third 


756,102 


74.7 


74.6 


74.2 


74.2 


SES High 


771,700 


77.9 


77.8 


77.5 


77.0 




In School. 


2,491,861 


79.9 


79.8 


79.3 


78.9 


Dropout 


301,788 


42.4 


42.1 


41.7 


41.7 



Source: National Education Ixmgitudinal Study of 1988: Second Follow -Up, U.S. Department of Education, National Center for 
Education Statistics. 
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out by subgroup within time point. Inspection of Tables 3.4 and 3.5 indicates that there is a dropoff in 
participation rates at the second follow-up. This decline in participation rates docs not appear to be 
completely random. There is some indication that the lowest SES quartile was less likely to participate 
in the second follow-up cognitive testing. This apparent bias in response rates may lead to some bias in 
the estimates of the gain between the first and second follow-up. It is suggested here that researchers 
might estimate gain under differing assumptions about the causal mechanism underlying the missing scores 
to get a "handle" on the robustness of their population estimates. Checks on the robustness of one's 
estimates is desirable here since no attempt was made to develop test score sampling weights that are 
adjusted for non-response. 

Table 4.1 in the next section compares the eligible NELS population of second follow-up grade 
12 students with those who actually took the cognitive battery and also shows the comparable figures for 
the NAEP twelfth grade sample. (By definition, all NAEP participants took the NAEP tests. Students 
who were selected but for some reason not tested were deleted from the sample. However, NELS:88 
sample members who were not tested may have participated in some other part of the survey, and 
remained in the sample.) These are weighted estimates. Table 4.1 indicates that about 78% of the eligible 
seniors took the cognitive battery, while 22% of the seniors did not take the cognitive battery. However, 
the subpopulation percentages of those who did participate reflect pretty much the same proportions as 
the second follow-up eligible population. There appears to be little evidence here suggesting that the 
missing cognitive scores for the in-school weighted population are non-representative of the eligible in- 
school population. 
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Chapter 4 
Normative and Proficiency Level Scores 



The cognitive test scores on the NELS:88 data files are of two broad types, normative scores and 
mastery scores. The normative scores are estimates of overall test performance and are available for all 
four cognitive areas at all three time points. Several transformations of the normative scores are included 
in the database: each of the scores is included in the original CRT-Estimated Number Right metric; each 
is transformed to a T-score metric, with standardization being done with respect to both the cross-sectional 
and longitudinal samples; finally, a quartile score ranks each test taker within the cross-sectional 
distribution of scores at each time point. 

The second broad type of scores are mastery scores, or criterion referenced proficiency scores. 
These measure mastery of certain skill levels rather than being overall measures of performance. In the 
NELS:88 test battery, mastery levels have been defined only for the reading, math and science tests. 
Dichotomous and continuous measures of mastery are included in the database. The first is an indicator 
of whether the test taker passed or failed the cluster of test items that defined each proficiency level. The 
continuous measures represent the probability of a test taker passing each level, based on overall test 
performance. 

Each of the scores in the database is discussed separately below. 



IRT Estimated Number Right 

The IRT-estimated number right for any individual at any one of the three time periods reflects 
an estimate of the number of items that a person would have answered correctly if he or she had taken 
all of the items that appeared in any form of the test. It is the probability of a correct answer on each 
item, summed over the total mathematics 81-item pool. The Bayesian Item Response Theory model 
allows one to put all the scores in, say Mathematics, on the same vertical scale so that the scores, 
regardless of the grade, can be interpreted in the same way. All the normal statistical operations that apply 
to any cognitive test score can be legitimately applied to the IRT-estimated number right. For example, 
a student's IRT-estimated number right in Mathematics in the tenth grade might be 41.3. That same 
student might have had an IRT-estimated number right of 35.3 in Math in the eighth grade and 44.5 in 
the twelfth grade. This particular student gained six points between the eighth and tenth grade (41.3 - 35.3 
= 6) and 3.2 points between the tenth and twelfth grade (44.5 -41.3 = 3.2). The student's total gain over 
the four years was 9.2 points. The IRT-estimated number right in theory could range from a random 
guessing score to 81 correct in Mathematics. In fact, no one in the sample has either a random guessing 
score or a perfect score in Mathematics. The reader will notice that the IRT-estimated number right scores 
are not necessarily whole numbers, but typically include a decimal since they represent sums of 
probabilities. ERT scoring takes into consideration the pattern of correct answers and not just the simple 
number correct. In this sense IRT scoring tries to make use of all the information in the answer pattern. 
Everybody who has taken any test on any one or more of the three occasions will have at least one score 
in this metric. That is, an individual docs not have to be a member of the longitudinal sample to have 
a score in this metric. 
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IRT Theta "T" Score 

The IRT Theta "T" score has a mean of 50 and a standard deviation of 10 where the 
standardization (mean 50 and SD of 10) was carried out on the weighted panel sample, i.e., on people who 
were NELS:88 core sample participants in all three waves. As in the case of the IRT-estimated number 
right all individuals, regardless whether they were in the panel sample or not, will have a score in this 
metric for any time point(s) in which they did have a test score. The IRT-estimated number right is a 
non-linear transformation of the original theta scores. The rank ordering of individuals on this metric and 
the IRT-estimated number right metric is identical. As in the case of the IRT-estimated number right all 
the usual statistical operations that are typically used with gain scores are appropriate. Since the IRT- 
estimated number right is tied to the total item pool and thus the metric may seem more interpretable, one 
might prefer the IRT-estimated number right metric to the "T" score Theta metric. For example, an 
individual who has an estimated IRT-estimated number right of, say 40.3, can be said to be expected to 
get about half the items correct in the total pool. Because of the non-linear transformation between the 
Theta metric and the IRT-estimated number right metric the Theta metric tends to "stretch" out the scores 
at the extreme tails. This would have little impact on virtually all the typical statistical analysis done on 
gain scores and thus any analyses using the IRT-estimated number right or the Theta metric scores will 
be similar. The choice between the two is more a matter of preference of one metric or the other with 
respect to interpretability. 



Cross-Sectional Scores 

There are four additional cross-sectional scores available on the NELS:88 data files. These scores 
are called cross-sectional because they are all calibrated within each of the three separately-weighted 
sample waves. These cross-sectional scores are primarily used in statistical tables that describe score 
results within a particular grade, e.g., the twelfth grade, and use the cross-sectional weights associated with 
that wave of data. 

Each of the four content areas in each of the three waves has a t-score transfor /nation of the IRT 
Estimated Number Right score. Unlike the Theta t-score, which is standardized wit) . respect to all three 
waves of data combined, this transformation is based on the test scores for each year considered 
separately. All scores for core (weighted) sample members, including freshened samples in the two 
follow-up years, are used in obtaining *he parameters for the transformation to a mean of 50 and SD of 
10. That is, the IRT Estimated Number Right T Score will have this weighted mean and standard 
deviation when aggregated over all core participants in a single year with the cross-sectional weight 
used in computing the statistics. Test takers who are not in the weighted core sample also have this 
score, which is computed using the same parameters as the core sample, but will not necessarily result in 
the same mean and standard deviation. 

All four content areas in each of the three grades have Achievement Quartile scores, which are 
based on a weighted frequency distribution of core sample students within each year. The IRT Number 
Right Score, IRT t-score, and Theta t-score all preserve the same rank-ordering of students within year. 
Any of these can be used to determine the score cut points that divide the weighted frequency distribution 
into four equal groups. A quartile score of "1" corresponds to the lowest group, and "4" is the highest. 
Quartile scores are also assigned to test takers who are not in the core sample by using the same cut points 
as for the core students. The appropriate interpretation of a quartile score of "2" for an augmented-samplc 
student in the second follow-up, for example, would be: "This student has a score that would put him or 
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her in the second quartile of twelfth graders nationwide in that year." Again, quartile scores for additional 
^XSSSly divid/the omcr samples into four equal groups since 0* d.tnbuuon of 
scores may not match that of the nationally representative weighted core sample. 

described above, the parameters for standardizing the composite and the cut points ™ «™™g , t ™ 
Srite «L» applied to the non-core samples to produce scores that allow these samples to be 
compared to national population estimates. 



Criterion-Referenced Proficiency Scores 

In addition to the normative interpretations in the NELS cognitive tests, the reading mathematics 
and scene te TZ o ^criterion referenced interpretations. The criterion- referenced ^terpretahons 
Z based n fudlmsTmonstrating proficiencies on clusters of items ^I*t*™± ^ i™ 
test score scale For example, there are three separate clusters of items in reading that mark the low 
m d r and high end of the' reading scale. The items that make up these cluster. ; exemplify the skrils 
required* successfully answer the typical item located at these points along the scale. 

General Description of the Proficiency levels 

The three levels of proficiency in the reading test, five in the mathematics test, and three in the 
science test, are as follows: 

Reading 

Reading Level 1: Simple reading comprehension including reproduction of detail and/or the author's 
main thought. 

Reading Level 2: Ability to make relatively simple inferences beyond the author's main thought 
and/or understand and evaluate relatively abstract concepts. 

Reading Level 3: Ability to make complex inferences or evaluative judgments that require piecing 
together multiple sources of information from the passage. 



Mathematics 

Math Level 1 : Simple arithmetical operations on whole numbers: essentially single step operations 
which rely on rote memory. 



61 

R. : : 



Psychometric Report for the NELS. 88 
Base Year Through Second Follow-Up 



Math Level 2: Simple operations with decimals, fractions, powers and roots. 

Math Level 3: Simple problem solving, requiring the understanding of low level mathematical 
concepts. 

Math Level 4: Understanding of intermediate level mathematical concepts and/or having the ability 
to formulate multi-step solutions to word problems. 

Math Level 5: Proficiency in solving complex multi-step word problems and/or the ability to 
demonstrate knowledge of mathematics material found in advanced mathematics 
courses. 



Science 

Science Level 1: Understanding of everyday science concepts; "common knowledge" that can be 
acquired in everyday life. 

Science Level 2: Understanding of fundamental science concepts upon which more complex science 
knowledge can be built. 

Science Level 3 Understanding of relatively complex scientific concepts; typically requiring an 
additional problem solving step. 

JUS* a * T 10 ldndS ° f Criteri0n rcferenced Proficiency scores. The first kind is a dichotomous 
score of 0 or 1 where a "1" indicates mastery of the material at this objective level and a "0" implies 
non-mastery. The second kind is a continuous score indicating the probability that a student has mastered 
toe type of items that describe a particular criterion referenced level. The proficiency levels are 
hierarchically ordered in the sense that mastery of the highest level among three levels implies that one 
wouM have also mastered the lower two levels. A student who has mastered all three hierarchical levels 
would have a dichotomous score pattern for the three levels of [1 1 1]. Similarly a student who only 
mastered the first two levels would have a dichotomous score pattern of [1 1 0]. A "reversal" pattern such 
as [0 11] that is, a failed easy level followed by one or more passed more difficult levels, is inconsistent 
with the hierarchical model. Students who omitted items that were critical to determining proficiency 
level, or who have reversals in proficiency score patterns will have a "blank" instead of a "0" or "1" 
? ( S! den !f 1 !' h0 t0 ° k en0Ugh 0f Uie items markin S I** Proficiency levels and who had no reversals will have 
0 or l scores for each of the proficiency levels that were available for that grade and content area 
The vast majority of students did fit the hierarchical proficiency model, i.e., had no reversals' 
Dichotomous proficiency scores are present for reading, mathematics, and science. The twelfth grade had 
typically more dichotomously scored proficiency levels than the lower grades since it alwavs incorporated 
all the lower levels plus any new more difficult level(s). Also the most difficult mathematics form did 
not include the easiest proficiency level and the easiest form did not include the most difficult proficiency 
level. There were four items that served as markers for each proficiency level. A student was defined 
to be proficient at a given proficiency level if he or she got any 3 of 4 items correct that "mark" that level 
Items were selected for a proficiency level if they shared similar cognitive processing requirements and 
this cognitive demand similarity was reflected in similar item difficulties. 

Analyses using the dichotomous proficiency scores include descriptive statistics that show the 
percentages of various subpopulations who have demonstrated proficiencies at each of the hierarchical 
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levels. They can also be used to examine patterns of change with respect to proficiency levels. An 
example of this type of analysis using dichotomous proficiency scores can be found in Rock, Owings & 
Lee (1994). 

The second kind of proficiency score is the probability of being proficient at each of the levels. 
This is a continuous analog to the dichotomous proficiency scores. The advantage of the probability of 
being proficient at each of the levels over the dichotomous proficiencies is that: (1) They are continuous 
scores and thus all the more powerful statistical methods can be applied, and (2) probabilities of being 
proficient at each of the levels, say in grade 10 are available for any individual who had a test score in 
grade 10. This second advantage is true since the IRT model enables us to estimate how a person would 
do on even those items that he or she was not given, e.g., if they were on a different form or not given 
in that grade. By contrast, the item-based dichotomous scores depend heavily on students answering the 
actual items in the cluster. 

The proficiency probabilities are particularly appropriate for relating specific processes to changes 
that occur at different points along the score scale. Since the proficiency levels are hierarchical they mark 
different ascending points along the score scale. For example, one might wish to evaluate the impact of 
taking advanced math courses on changes in mathematics from grade 10 to grade 12. One approach to 
doing this would be to subtract every student's tenth grade IRT-estimated number right from the their 
twelfth grade IRT-estimated number right and correlate this difference with the number of advanced 
mathematics courses taken between the tenth and twelfth grade. The resulting correlation will be relatively 
small because individuals taking no advanced mathematics courses are also gaining but probably at the 
low end of the test score scale. Individuals who are taking advanced mathematics courses are also gaining 
but at the higher end of the test score scale. To be more concrete, let us say that the individuals who took 
none of the advanced math courses gained on average 3 points, all at the low end of the test score scale. 
Conversely the individuals who took the advanced math courses gained 4.5 points but virtually all these 
individuals made their gains at the upper end of the test score scale. When the researcher correlates 
courses with gains, the fact that on average the advanced math takers gained only slightly more than those 
taking no advanced mathematics courses will lead to a very small correlation between gain and process 
(advanced math course taking). This low correlation has nothing to do with reliability of gain scores, but 
it has much to do with where on the test score scale the gains are taking place. Gains in the upper end 
of the test score distribution reflect increases in knowledge in advanced mathematical concepts and 
processes while gains at the lower end reflect gains in basic arithmetical concepts. In order to relate 
specific processes to gains successfully one has to match the process of interest to where the gain is taking 
place. 

The proficiency probabilities do this since they mark ascending places on the test score 
distribution. If I wish to relate the number of advanced math courses taken to changes, I should be 
looking at changes at the upper end of the test score distribution. How does one use the proficiency 
probabilities to do this? There are five proficiency levels in mathematics with level 4 and level 5 marking 
the two highest points along the test score scale. One would expect the taking of advanced math courses 
to have its greatest effects on changes in probabilities of being proficient at these highest two levels. Thus 
one would simply subtract each individuals tenth grade probability of being proficient at say level 4 from 
the corresponding probability of being proficient at level 4 in twelfth grade. Now every individual has 
a continuous measure of change in mastery of advanced skills rather than along the whole score scale. 
One then correlates this change in level 4 probabilities with the number of advanced mathematics courses 
taken and we will observe a substantial increase in the relationship between change and process (number 
of advanced mathematics courses taken). One might wish to do the same thing with the level 5 
probabilities as well. The main point here is that certain school processes, in particular, course taking 
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patterns, target gains at different points along the test score distribution. One has to match the type of 
school process one is evaluating with the location on the test score scale where the gains are likely to be 
taking place and then select the proper proficiency levels for appropriately evaluating that impact. (For 
an example of the use of probability of proficiency scores to measure mathematics achievement gain in 
relation to program placement and course taking, see Chapter 4 of Scott, Rock, Pollack & Ingels, 1995). 



NAEP Equated Score 

The goals set out for the NELS:88 lest battery in the base year included generation of mathematics 
cross-walks with two other studies. The NELS:88 tests were to share sufficient common items with the 
HS&B battery to support cross-sectional equating with the 1980 HS&B sophomore cohort in mathematics 
(for an example of such HS&B/NELS:88 equating, see Rasinski, Ingels, Rock & Pollack, 1993). The 
NELS.88 tests were also to provide sufficient item overlap with the National Assessment of Educational 
Progress (NAEP) mathematics test at twelfth grade to cross-walk to the NAEP mathematics scale. 

Hence a score on the NAEP scale in mathematics has been placed on the NELS: 88 1992 data file 
for every student who had a twelfth grade NELS mathematics score. This is an equated score based on 
an equipercentile equating procedure. The validity of the equating procedure relies on the fact that both 
the NAEP and NELS samples are probability samples from the same parent population. In addition, the 
equating assumes that the test provided a reasonable match in content. Table 4.1 contains the 
subpopulation makeup of the two samples. 



Table 4.1 

A Comparison of the NAEP and NELS 12th Grade Samples 



Estimated proportion of selected subpopulation based on weighted percentages 




NAEP 


NELS 
Population 


NELS 
Test Takers 


Total Population Estimate 


2,522,170 


2,537,024 


1,979,737 


Male 


48.8% 


50.4% 


50.9% 


Female 


51.2% 


49.6% 


49.1% 




White 


71.1% 


72.3% 


73.3% 


Black 


14.7% 


11.9% 


11.4% 


Hispanic 


9.5% 


10.0% 


9.8% 










Public 


87.1% 


89.9% 


90.1% 


Private 


4.5% 


4.3% 


3.9% 


Catholic 


8.4% 


5.8% 


5.9% 



Source: National Education Longitudinal Study of 1988: Second Follow-Up and National Assessment of Educational Progress 1992 
Twelfth Grade Sample, U.S. Department of Education, National Center for Education Statistics. 
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Empirical checks on the validity of the equating procedure included comparing subgroup differences on 
the equated score with those found on the original NAEP scale. Virtually all checks were within one 
standard error. A researcher who wishes to look at the relationship between the background and process 
variables from the NELS data base using the NAEP mathematics scale score can now do so. 
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Chapter 5 

Psychometric Properties of the NELS:88 Scores 

In the final analysis the reliability and validity of the NELS:88 cognitive scores depend on the: 
1) appropriateness of the test content specifications, 2) psychometric quality of the test items themselves, 
3) appropriateness of the difficulty of the tests for the students being measured, 4) lack of speededness, 
5) success of the IRT procedures used for linking across grades and forms, and 6) scoring procedures. 
Previous sections discussed content specifications, psychometric qualities of the items, appropriateness of 
item difficulties, speededness and linking procedures used. This chapter provides both traditional indices 
of reliability as well as IRT centered estimates. In addition evidence for the construct and predictive 
validity of the NELS:88 scores are presented. 



Reliability of the IRT Scores 

An approximate index of the reliability of the IRT theta estimates is presented in Table 5.1 by 
grade and content area. While the plot of the information function is the most comprehensive measure 
of the reliability of the IRT scores, it is sometimes helpful to present an estimate of the more familiar 
single index type. These indices are computed as 1 minus the ratio of the average measurement error 
variance to the total variance (see for example, Samejima, 1994). 



Table 5.1 
Reliability of Theta 





Base 


First 


Second 




Year 


Follow-up 


Follow-up 


Reading 


.80 


.86 


.85 


Math 


.89 


.93 


.94 


Science 


.73 


.81 


.82 


History/Citizenship/Geography 


.84 


.85 


.85 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education. National Center for 
Education Statistics. 



o 2 (8) 
where : 

a* = posterior variance for the j'th subtest 
(T^fi) = variance of the thctas 
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Inspection of Table 5.1 indicates that the introduction of the adaptive forms in grade 10 and 12 reading 
and math, lead to substantial increases in reliability. It should be noted that the base year psychometric 
report (Rock & Pollack, 1991) reported coefficient alpha reliabilities based on the observed scores. 
Because of the. adaptive nature of the reading and mathematics tests at first and second follow-up the same 
reliability estimation procedure was no longer appropriate. This report, in order to be consistent across 
all subject areas and time points, used die IRT reliability estimation procedure for all measures whether 
they were adaptive or not. The information functions are presented in Appendix G. The test information 
function shows the relationship between the amount of information available in the items for estimating 
the ability scores at each point in the ability distribution. More specifically, the test information function 
estimates the reciprocal of the squared standard error of measurement at each ability level. The greater 
the amount of information at a given ability level, the more closely the estimates of ability cluster around 
the true ability level (Baker, 1992). That is, the greater the height of the test information function the 
more precise the estimates. The fact that the height of the curve is much reduced as one moves towards 
the tails indicates that the maximum information function occurs in the middle of the range, where the 
item difficulty approximates the abilities of the majority of the test takers. This latter property is precisely 
why the NELS:88 battery developed adaptive test forms in mathematics and reading. 



Construct Validity of the NELS:88 Content Areas 

Table 5.2 presents the intercorrelations of the content areas by year of administration. There is 
some tendency for the intercorrelations among content areas to increase with grade in school. That is the 
average intercorrelations among content areas are .72, .75, and .76 for tne eighth, tenth, and twelfth grade 
respectively. Correlations between adjacent administrations within the same content areas tend to be 
higher then those found between content areas within the same administration. The finding is consistent 
with the notion that the content areas should show some discriminant validity. Additional information on 
the discriminant validity for the content areas can be found in Rock & Pollack (1991). Also correlations 
between eighth and tenth grade scores tend to be lower than those found between tenth and twelfth grade 
scores within all the content areas. This is consistent with the fact that proportionately greater changes 
in achievement measured by these tests occurred between the eighth and tenth grade than occurred 
between the tenth and twelfth grade. 

While the internal correlational analyses among the scale scores show some discriminant and 
convergent validity for the content areas, they tell us little about how well the application of Bayesian IRT 
approaches "worked" compared to the more traditional baseline technique based on the LOGIST 
conditional maximum likelihood estimation. The following discussion presents some results comparing 
two variations of the Bayesian approach with each other and with LOGIST. The results are presented for 
the mathematics content area since it was the most complex to scale because of its seven forms. Validity 
for the three approaches to IRT scaling as well as for the content areas themselves is defined here in terms 
of the pattern of correlations between their IRT scores and relevant outside process and demographic 
variables. In the end longitudinal studies that emphasize policy decisions must concern themselves with 
describing the extent of the relationship between student performance and school and home-based learning 
experiences. 
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Table 5.2 
Intercorrelations of Content Areas 
Within and Across Administrations 





READ 
BY 


MATH 
BY 


SCI 
BY 


HIST 
BY 


READ 
Fl 


MATH 
Fl 


SCI 
Fl 


HIST 
Fl 


READ 
F? 


MATH 
F2 


SCI 
F3 


TEST 
F2 


READ BY 
MATH BY 


i.00 
0.7 1 


l.OO 




















SCI BY 


0.7 1 


0.73 


1 AA 
1.00 


















HIST BY 


0.73 


0.69 


0.73 


1 AA 
1.00 
















READ Fl 


A OA 

0.80 


0.69 


A 

U.oe 


u. / 1 


1 AA 

1.00 














(11 (TIT T T~* 1 

MATH Fl 


0.69 


A OO 

U.BS 


A Hf\ 
U. /O 


A £H 

U.o / 


u. /o 


1 (V\ 
1 .UV» 












SCI Fl 


0.66 


A 11 
U. II 


a ha 


A 


A HA 

U./4 


V. 17 


i on 










HIST Fl 


0.67 


0.65 


0.68 


0.76 


0.75 


0.72 


0.77 


1.00 








READ F2 


0.74 


0.65 


0.64 


0.66 


0.82 


0.71 


0.69 


0.70 


1.00 






MATH F2 


0.66 


0.83 


0.68 


0.65 


0.73 


0.92 


0.77 


0.70 


0.74 


1.00 




SCI F2 


0.63 


0.70 


0.71 


0.65 


0.69 


075 


0.80 


0.70 


0.73 


0.79 


1.00 


HIS F2 


0.66 


0.64 


0.66 


0.71 


0.71 


0.69 


0.72 


0.78 


0.75 


0.73 


0.77 


1.00 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. 

One of the concerns outlined above in the preceding scaling chapter was the potential for LOGIST 
estimates to have ceiling effects for high scoring tenth grade students. Such students would not have any 
"room" to gain between the tenth and twelfth grades. We would expect that such limiting effects if they 
exist would show up when groups of advanced students were compared with groups of students who are 
less advancec 1 . For example, one might get an underestimate of differences in gains between the students 
who take advanced mathematics courses versus those who do not. Part of this underestimate may be 
attributable to the fact that LOGIST procedures have no systematic way to deal with ceiling and near 
ceiling effects for high scoring students on the base year and first follow-up tests. 

Tables 5.3, 5.4 and 5.5 present correlations of gains and selected background and process variables. 
Gains arc shown in the Theta and "true" score metric for the 8-10, 10 - 12, and the 8 - 12 (total gain) for 
LOGIST estimates and for two kinds of Bayesian approaches (ST1 and ST4). In addition, grade 8 to 12 
gains in proficiency probabilities at each of the five mathematics proficiency levels are also correlated with 
background and process variables. As indicated in Chapter 4 die proficiency probabilities are simply the 
probability that a given individual has "mastered" the skills defined by the items marking each of the 
proficiency levels. Like any score these probabilities can be monitored for gains taking place at any one 
of five proficiency levels. The Theta metric and the "true" score metric are also discussed in chapter 4. 
The two kinds of Bayesian procedures differ in whether they use a normal prior (ST1) or a distribution 
free prior (ST4). 
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# Units 
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0.25 j 


0.18 


0.09 


0.21 


0.10 


0.20 


0.35 


0.22 


IE METRIC 


0.22 


0.27 


0.26 


0.15 


0.19 


0.20 


0.31 


0.36 


0.35 




GAIN IN THETA ME1 


GAIN 8-10 LOG 


GAIN 8-10 STl 


GAIN 8-10 ST4 


GAIN 10-12 LOG 


GAIN 10-12 STl 


GAIN 10-12 ST4 


TOTAL GAIN LOG 


TOTAL GAIN STl 


TOTAL GAIN ST4 


GAIN IN TRUE SCOI 


GAIN 8-10 LOG 


GAIN 8-10 STl 


GAIN 8-10 ST4 


GAIN 10-12 LOG 


GAIN 10-12 STl 


GAIN 10-12 ST4 


TOTAL GAIN LOG 


TOTAL GAIN STl 


TOTAL GAIN ST4 
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Table 5.5 

Evaluation of Alternative Scoring Procedures for Grade 8-10-12 Math 
CORRELATIONS OF GAIN WITH INITIAL (GRADE 8) STATUS 
3 METHODS: "LOG"=LOGIST; "ST1" = NALS 1-STEP METHOD; "ST4" = NAEP 4-STEP METHOD 





THETA METRIC 


TRUE SCORE METRIC 




TH8 LOG 


TH8 ST1 


TH8 ST1 


NR8 LOG 


NR8 ST1 


NR8 ST4 


GAIN IN THETA METRIC 


GAIN 8-10 LOG 


-0.2977 


-0.1737 


-0.1800 


-0.1794 


-0.1458 


-0.1418 


GAIN 8-10 ST1 


-0.0465 


-0.0106 


-0.0080 


-0.0171 


-0.0043 


-0.0076 


GAIN 8-10 ST4 


-0.1816 


-0.1630 


-0.1595 


•0.1796 


-0.1674 


-0.1763 


GAIN 10-12 LOG 


-0.0074 


0.0013 


-0.0004 


0.0043 


0.0061 


0.0070 


GAIN 10-12 ST1 


0.0520 


0.0563 


0.0512 


0.0669 


0.0634 


0.0696 


GAIN 10-12 ST4 


-0.1164 


-0.1115 


-0.1194 


-0.0935 


-0.0960 


-0.0855 


TOTAL GAIN LOG 


-0.2957 


-0.1680 


-0.1754 


-0.1710 


-0.1368 


-0.1322 


TOTAL GAIN ST1 


0.0000 


0.0321 


0.0305 


0.0345 


0.0422 


0.0441 


TOTAL GAIN ST4 


-0.2403 


-0.2207 


-0.2234 


-0.2221 


-0.2134 


-0.2135 


GAIN IN TRUE SCORE METRIC 


GAIN 8-10 LOG 


-0.1147 


-0.0742 


-0.0667 


-0.0998 


-0.0795 


-0.0901 


GAIN 8-10 ST1 


0.0116 


0.0274 


0.0379 


0.0040 


0.0158 


0.0036 


GAIN 8-10 ST4 


0.0071 


0.0188 


0.0323 


-0.0170 


-0.0020 


-0.0217 


GAIN 10-12 LOG 


0.0182 


0.0166 


0.0189 


0.0126 


0.0135 


0.0106 


GAIN 10-12 ST1 


0.0046 


-0.0018 


-0.0020 


-0.0012 


-0.0039 


-0.0030 


GAIN 10-12 ST4 


0.0048 


-0.0004 


-0.0007 


0.0005 


-0.0015 


-0.0005 


TOTAL GAIN LOG 


-0.0872 


-0.0526 


-0.0441 


-0.0784 


-0.0597 


-0.0714 


TOTAL GAIN ST1 
i 


0.0128 


, 0.0212 


0.0297 


0.0024 


0.0103 


0.0008 


TOTAL GAIN ST4 


0.0091 


0.0153 


0.0262 


•0.0137 


-0.0026 


-0.0183 
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Table 5.5 (cont'd) 
Evaluation of Alternative Scoring Procedures for Grade 8-10-12 Math 
CORRELATIONS OF GAIN WITH INITIAL (GRADE 8) STATUS 
3 METHODS: "LOG"=LOGIST; "ST1" = NALS 1-STEP METHOD; M ST4" = NAEP 4-STEP METHOD 





THETA METRIC 


TRUE SCORE METRIC 


TH8 LOG 


TH8 ST1 


TH8 ST1 


NR8 LOG 


NR8 ST1 


NR8 ST4 


GAIN IN PROFICIENCY PROBABILITY (8-12)* 


GAIN: LEVEL 1 LOG 


-0.5979 


-0.5595 


-0.5856 


-0.5067 


-0.5025 


-0.4700 


GAIN: LEVEL 1 ST1 


-0.6479 


-0.6560 


-0.6831 


-0.6061 


-0.6123 


-0.5837 


GAIN: LEVEL 1 ST4 


-0.6611 


-0.6158 


-0.6447 


-0.5545 


-0.5515 


-0.5159 


GAIN: LEVEL 2 LOG 


-0.4948 


-0.5704 


-0.5768 


-0.5715 


-0.5877 


-0.5868 


GAIN: LEVEL 2 ST1 


-0.4461 


-0.5355 


-0.5330 


-0.5520 


-0.5703 


-0.5772 


GAIN: LEVEL 2 ST4 


-0.5419 


-0.6181 


-0.6294 


-0.6128 


-0.6299 


-0.6264 


GAIN: LEVEL 3 LOG 


-0.0601 


-0.0992 


-0.0652 


-0.1509 


-0.1475 


-0.1717 


GAIN: LEVEL 3 ST1 


-0.0724 


-0.1173 


-0.0317 


-0.1710 


-0.1694 


-0.1939 


GAIN: LEVEL 3 ST1 


-0.1353 


-0.1921 


-0.1588 


-0.2458 


-0.2472 


-0.2721 


GAIN: LEVEL 4 LOG 


0.3666 


0.4370 


0.4470 


0.4154 


0.4448 


0.4277 


GAIN: LEVEL 4 ST1 


0.3263 


0.3846 


0.4016 


0.3567 


0.3848 


0.3652 


GAIN: LEVEL 4 ST4 


0.4002 


0.4752 


0.4843 


0.4535 


0.4835 


0.4662 


GAIN: LEVEL 5 LOG 


0.4470 


0.5406 


0.5240 


0.5449 


0.5659 


0.5669 


GAIN: LEVEL 5 ST1 


0.5232 


0.6209 


0.6065 


0.6256 


0.6484 


0.6473 


GAIN: I EVEL 5 ST4 


0.5044 


0.5809 


0.5611 


0.5967 


0.6054 


0.6139 


GRADE 12 THETA AND TRUE SCORE 


GR12 THETA LOG 


0.7593 


0.8038 


0.8017 


0.7990 


0.8020 


0.7976 


GR12 THETA ST1 


0.7902 


0.8440 


0.8412 


0.8390 


0.8445 


0.8397 


GR 12 THETA ST4 


0.7855 


0.8339 


0.8346 


0.8221 


0.8284 


0.8200 


GR12 TRUE SCORE LOG 


0.7700 


0.8241 


0.8238 


0.8157 


0.8229 


0.8162 


GR12TRUE SCORE ST1 


0.7850 


0.8414 


0.8407 


0.8327 


0.8406 


0.8337 


GR12 TRUE SCORE ST4 


0.7864 


0.8431 


0.8423 


0.8347 


0.8424 


0.8356 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for Education 
Statistics. 
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Inspection of Tables 5.3, 5.4 and 5.5 indicates that in the Theta metric the normal prior Bayesian 
procedure (ST1) shows stronger relationships between gains and virtually all the process/demographic 
variables than do the other two procedures. The differences in favor of ST1 are particularly strong where 
contrasts are being made between groups quite different in their mathematics preparation, e.g., the 
relationship between being in the academic curriculum or "taking math now" and total gain. 

When the correlations are based on he "true" score metric the ST1 Bayesian approach still does 
as well or better than the other two approaches. The "true" score metric is a non-linear transformation 
of the Theta scores and unlike the Thetas does not quite stretch out the tails of the score distribution as 
much as the Thetas. The stretching out at the tails has little impact on most analyses except if one is 
contrasting grouns whose scores put them in or near the tail of the distribution. 

The proficiency probabilities recorded in Tables 5.3, 5.4 and 5.5 demonstrate the importance of 
relating specific processes with changes taking place at appropriate points along the score distribution. 
These proficiency probabilities were defined in more detail in Chapter 4. Inspection of Table 5.4 indicates 
that gains between 8th and 12th grade in the probability of being proficient at level four (GPL4) show a 
positive correlation with number of units of mathematics of .44. The correlations between gains in 
probability of mastery and various course exposures vary some by estimation method, but in general the 
one-step Bayesian procedure does as well as the other methods. One of the primary purposes o f the 
proficiency levels is to provide information for each individual on where on the scale his or her changes 
are taking place. For example, an individual who had a high scale score (on the Tneta or "true score 
scale) in tenth grade and then received an even higher score in the twelfth grade would show his or her 
greatest gains in probability of mastery at either levels 4 or 5, the levels that mark the upper end of the 
scale. 

When the "dummy" variable contrasting whether an individual is in the academic curriculum, 
coded "1" versus the general/vocational curriculum coded "0" is correlated with gains in probabilities at 
the various proficiency levels, one observes negative correlations for demonstrated proficiencies at the two 
lower levels (simple operations and fractions and decimals) and increasingly higher positive correlation 
for levels 3 through 5. That is, individuals with a score of "1" on the dummy variable indicating they are 
in the academic curriculum are making progressively greater gains in probabilities associated with mastery 
of levels 3 through 5. Conversely individuals who are coded "0" indicating that they arc in the 
general/vocational curriculum are making their greatest gains in the two lower levels (simple operations 
and decimals/fractions). These general/vocational students' gains are typically taking place at the lower 
end of the scale and thus the negative correlation in the last column of Table 5.3. They are increasing 
their probabilities of proficiency primarily at the two lowest levels. 

Tables 5.6-5.11 present similar correlations for reading, science, and H/C/G respectively. The 
ST1 procedure was selected on the basis of the math test results, so only ST1 estimates were computed 
for these content areas. 
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Table 5.7 
Correlations of Transcript Variables 
with Second Follow-up Status and Gains 
Reading 





Total # 
Units 


Average 
Grades 


Second Follow-up Status 


IRT Number Right 


0.26 


0.52 


Standardized Theta 


0.26 


0.53 


Proficiency Level 1 


0.16 


0.22 


Proficiency Level 2 


0.25 


0.49 


Proficiency Level 3 


0.17 


0.45 


Gain: Base Year to First Follow-up 


IRT Number Right 


0.13 


0.16 


Standardized Theta 


0.13 


0.18 


Proficiency Level 1 


0.00 


-0.06 


Proficiency Level 2 


0.11 


0.10 


Proficiency Level 3 


0.12 


0.30 


Gain: First to Second Follow-up 


IRT Number Right 


0.00 


-0.01 


Standardized Theta 


0.00 


0.02 


Proficiency Level 1 


-0.06 


-0.07 


Proficiency Level 2 


0.00 


-0.06 


Proficiency Level 3 


0.06 


0.14 


Total Gain: Base Year to Second Follow-up 


IRT Number Right 


0.11 


0.13 


Standardized Theta 


0.12 


0.18 


Proficiency Level 1 


-0.05 


-0.11 


Proficiency Level 2 


0.09 


0.03 


Proficiency Level 3 


0.16 


0.38 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. 
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Table 5.9 
Correlations of Transcript Variables 
with Second Follow-up Status and Gains 
Science 





Number of Units 




Total # 
Units 


Average 
Grade 


Earth 
Science 


Biology 


Chemis- 
try 


Physics 


utner 


Second Follow-up Status 




IRT Number Right 


0.44 


0.48 


0.02 


0.22 


0.43 


0.43 


-0.16 


Standardized Thet 


0.43 


0.48 


0.01 


0.22 


0.43 


0.43 


-0.16 


Proficiency Level 1 


0.25 


0.27 


0.03 


0.16 


0.25 


0.20 


-0.09 


Proficiency Level 2 


0.41 


0.45 


0.02 


0.22 


0.41 


0.39 


-0.15 


Proficiency Level 3 


0.38 


0.44 


-0.01 


0.15 


0.38 


0.43 


-0.13 


Gain: Base Year to First Follow-up 




IRT Number Right 


0.21 


0.21 


0.02 


0.10 


0.18 


0.19 


-0.04 


Standardized Theta 


0.20 


0.19 


0.02 


0.08 


0.16 


0.18 


-0.03 


Proficiency Level 1 


-0.04 


-0.10 


0.03 


-0.02 


-0.05 


-0.07 


U.UZ 


Proficiency Level 2 


0.20 


0.21 


0.02 


0.11 


0.17 


0.14 


-U.U.5 


Proficiency Level 3 


0.28 


0.32 


-0.03 


0.10 


0.25 


0.36 


n no 


Gain: First to Second Follow-up 




IRT Number Right 


0.01 


-0.01 


0.00 


0.00 


0.02 


0.00 


U.UU 


Standardized Theta 


0.01 


0.00 


0.00 


-0.01 


0.02 


0.01 


0.00 


Proficiency Level 1 


-0.09 


-0.10 


-0.01 


-0.05 


-0.11 


-0.07 


0.06 


Proficiency Level 2 


-0.02 


-0.05 


0.00 


0.00 


0.01 


-0.03 


-0.01 


Proficiency Level 3 


0.15 


0.16 


0.01 


0.06 


0.17 


0.11 


-0.04 


Total Gain: Base Year to Second Follow-up 




IRT Number Right 


0.21 


0.20 


0.02 


0.09 


0.19 


0.19 


-0.04 


Standardized Theta 


0.20 


0.19 


0.02 


0.07 


0.17 


0.19 


-0.04 


Proficiency Level 1 


-0.12 


-0.18 


0.02 


-0.07 


-0.14 


-0.13 


0.07 


Proficiency Level 2 


0.17 


0.15 


0.02 


0.11 


0.17 


0.11 


-0.04 


Proficiency Level 3 


0.35 


0.39 


-0.01 


0.14 


0.34 


0.38 


-0.11 



Source: National Education Longitudinal Study of 1988: Second Follow-Up. U.S. Department of Education. National Center for 
Education Statistics. 
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Table 5.11 
Correlations of Transcript Variables 
with Second Follow-up Status and Gains 
History/Citizenship/Geography 





Number of Units 




Total # 
Units 


Average 
Grade 


History 


Other 


Second Follow-up Status 


IRT Number Right 


0.25 


0.55 


0.24 


0.11 


Standardized Theta 


0.25 


0.54 


0.24 


0.11 


Gain: Base Year to First Follow-up 


CRT Number Right 


0.11 


0.14 


0.08 


0.06 


Standardized Theta 


0.09 


0.10 


0.06 


0.06 


Gain: First to Second Follow-up 


IRT Number Right 


0.02 


0.07 


0.00 


0.03 


Standardized Theta 


0.01 


0.06 


-0.01 


0.02 


Total Gain: Base Year to Second Follow-up 










IRT Number Right 


0.11 


0.18 


0.06 


0.08 


Standardized Theta 


0.09 


0.14 


0.05 


0.07 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for 
Education Statistics. 

The reader should note that the column labeled "total units" refers to the total number of semesters of 
mathematics, english, science or social studies courses taken depending on the content area being analyzed. 
As in the case of mathematics, the pattern of the total score gains and the proficiency probability gains 
were consistent with our theoretical expectations. That is, the aggregate (total) score gains show the 
expected patterns of overall gain while gains in proficiency probabilities show maximum relationships with 
school process that target learning that is appropriate for that particular mastery level. 
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Appendix E-2 
Teat Item Map 
Math 



Answer # Valid Item Number in Booklet IRT Parameters 





Key 


Choices 


88 


90L 


90M 


90H 


92L 


92M 


92H 




A 




B 




c 


1 


4(D) 


4 


28 


29 


23 


19 


30 


19 




0. 


68181 


-0. 


87241 


0. 


11087 


2 


2(B) 


4 










26 






0. 


81955 


-0. 


76121 


0. 


17258 


3 


4(D) 


5 


21 


22 




16 


22 






0. 


59218 


-1. 


64137 


0. 


00000 


4 


KA) 


4 




40 






17 






0. 


80777 


-2. 


94873 


0. 


06710 


5 


4(D) 


5 


29 


30 


24 


20 








0. 


79283 


-0. 


66171 


0. 


08814 


6 


3(C) 


4 


31 


32 


26 




28 






0. 


83407 


-1 . 


08544 


0. 


09471 


7 


2(B) 


5 


25 


26 






24 






0. 


89889 


-1. 


10120 


0. 


15730 


8 


2(B) 


4 


34 


34 


28 


23 


29 






1. 


01292 


-0. 


47088 


0. 


24387 


9 


3(C) 


4 


26 


27 


22 


18 


23 


17 




1. 


12383 


-0. 


46246 


0. 


35119 


10 


3(C) 


4 


32 


33 












0. 


87113 


-0. 


74347 


0. 


35651 


11 


2(B) 


4 


5 


3 


5 


4 


9 


4 




1 . 


29364 


-0. 


53688 


0. 


21087 


12 


4(D) 


4 


4 


2 


4 


3 


10 


6 




1. 


19470 


-0. 


33819 


0. 


20949 


13 


2(B) 


4 


9 


4 


9 


8 


11 






1. 


01044 


0. 


09795 


0. 


23418 


14 


KA) 


4 




7 






2 






0. 


71930 


-2. 


22133 


0. 


00000 


15 


4(D) 


4 


7 




7 


6 








1. 


07586 


-0. 


11721 


0. 


11326 


16 


3(C) 


4 


12 


11 


12 


11 








0. 


79942 


-0. 


40340 


0. 


05706 


17 


KA) 


4 


2 




2 


1 








0. 


60453 


-0. 


53500 


0. 


07134 


18 


1(A) 


4 


3 




3 


2 








0. 


92699 


0 


95693 


0. 


40262 


19 


KA) 


4 




8 












i . 


24943 


0. 


01075 


0. 


19848 


20 


3(C) 


4 




9 












1 


40404 


-0 


05373 


0. 


21384 


21 


KA) 


4 




6 






8 






0. 


56981 


-0 


92211 


0. 


19984 


22 


2(B) 


4 


13 


12 


13 


12 


12 


9 




0. 


88153 


-0. 


60426 


0. 


09364 


23 


4(D) 


4 


10 


5 


10 


9 


15 


11 




0 


96547 


0 


04512 


0. 


17120 


24 


2(B) 


4 


6 




6 


5 




12 


2 


1 


00754 


0 


45108 


0 


30110 


25 


2(B) 


4 


8 




8 


7 




13 


3 


0 


68957 


0 


27051 


0 


09071 


26 


KA) 


4 


11 


10 


11 


10 


16 


10 


1 


0 


82091 


0 


11529 


0 


11306 


27 


KA) 


4 














4 


0 


98903 


2 


29678 


0 


11834 


28 


KA) 


4 


14 


13 


14 


13 


14 


7 




1 


06022 


-0 


32865 


0 


14891 


29 


KA) 


4 


15 


14 




14 


7 






0 


99843 


-0 


61601 


0 


43884 


30 


2(B) 


4 


16 


15 


15 




3 


3 




0 


54766 


-2 


19425 


0 


00000 


31 


2(B) 


4 


17 


16 


16 




5 


5 




0 


54485 


-0 


76427 


0 


38465 


32 


2(B) 


4 


18 


17 


17 


15 


13 


8 




1 


15688 


-0 


26050 


0 


21053 


33 


2(B) 


4 


19 


18 


18 




1 


1 




0 


68679 


-2 


.21344 


0 


. 03540 


34 


3(C) 


5 


33 




27 


22 


34 


24 




0 


.54566 


0 


.93151 


0 


. 32992 


35 


2(B) 


4 


24 


25 


21 


17 


27 


16 




0 


.57035 


-1 


.18917 


0 


. 02352 


36 


4(D) 


4 


30 


31 


25 


21 


31 


21 


8 


0 


.58607 


-0 


.41898 


0 


.13473 


37 


2(B) 


4 


39 


38 


33 


28 


40 


23 


10 


1 


.30207 


0 


.06324 


0 


.12511 


38 


4(D) 


4 


37 




31 


26 








0 


.83285 


-0 


.59678 


0 


.00000 


39 


4(D) 


5 


40 


39 


34 


29 


33 


18 


6 


1 


.08731 


-0 


.19037 


0 


. 11735 


40 


2(B) 


4 


38 


37 


32 


27 




27 


13 


1 


. 36826 


1 


.29155 


0 


. 34865 


41 


2(B) 


4 












34 


26 


1 


.14429 


2 


.25687 


0 


. 25864 


42 


5(E) 


5 














29 


0 


. 69035 


1 


. 26821 


0 


. 00000 


43 


3(C) 


4 








30 




38 


32 


0 


.64398 


2 


.41658 


0 


. 12428 


44 


4(D) 


4 


36 


36 


30 


25 


36 


20 


7 


0 


.92334 


0 


.01612 


0 


. 12642 


45 


3(C) 


5 










38 


36 


22 


0 


.60561 


2 


.27172 


0 


.22935 


46 


3(C) 


4 








31 






23 


1 


.12318 


1 


.40632 


0 


.22014 


47 


3(C) 


4 








32 






19 


0 


.67679 


2 


.00317 


0 


.25383 


48 


3(C) 


4 














28 


1 


.48766 


2 


.12629 


0 


.19798 


49 


2(B) 


c 








33 






9 


2 


.14550 


1 


.07065 


0 


.34743 


50 


3(C) 


4 


35 


35 


29 


24 


25 


22 




0 


.60185 


-0 


.22727 


0 


.26618 


51 


3(C) 


3 






35 


34 




25 


12 


0 


.83282 


0 


.13847 


0 


.10066 


52 


KA) 


4 








35 






20 


1 


.36009 


1 


.15455 


0 


.06559 


53 


4(D) 


5 






36 


36 








0 


.59898 


-0 


.46164 


0 


.04239 


54 


3(C) 


5 






37 


37 




28 


11 


1 


.41513 


1 


.01649 


0 


.24226 


55 


1(A) 


5 






38 


38 




30 


18 


0 


.95161 


1 


.01715 


0 


.20330 
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Appendix E-2 
Test Item Map 
Math (Continued) 



56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 
67 
68 
69 
70 
71 
72 
73 
74 
75 
76 
77 
78 
79 
80 
81 



Answer 
Key 
3(C) 
1(A) 
5(E) 
2(B) 
1(A) 
4(D) 
1(A) 
3(C) 
3(C) 
2(B) 
3(C) 
5(E) 
5(E) 
4(D) 
4(D) 
KA) 
3(C) 
5(E) 
4(D) 
KA) 
4(D) 
3(C) 
KA) 
4(D) 
5(E) 
4(D) 



# Valid 
Choices 

5 

5 

5 

5 



5 
5 
4 
5 
5 
5 
5 
5 
4 
4 
5 
4 
5 
5 
5 



Item Number in 


Booklet 




IRT Parameters 




88 


90L 90M 


90H 


92L 


92M 


92H 




A 




B 




c 




39 


39 




32 


24 


0. 


73958 


1. 


25686 


0. 


16181 




40 


40 




31 


17 


0. 


85972 


0. 


85092 


0. 


10950 










40 


40 


1. 


33843 


2. 


81896 


0. 


04093 










39 


37 


1. 


31305 


2. 


77701 


0. 


15386 


1 


1 1 




6 


2 




1. 


13553 


-1. 


31660 


0. 


20392 


20 


21 19 




18 


14 




0. 


75484 


-2. 


25518 


0. 


00000 


22 


23 




19 






0. 


90953 


-1. 


58401 


0. 


00000 


23 


24 20 




20 


15 




0. 


41684 


-1. 


58628 


0. 


00000 


27 


28 




32 






1. 


55719 


-0. 


74660 


0. 


16430 




19 










1 


11627 


-0 


00395 


0. 


16357 




20 




4 






0 


86183 


-1 


94097 - 


0 


COOOO 








21 




5 


0 


52694 


-1 


59965 


0 


00000 








35 




15 


1 


14276 


0 


46401 


0 


08410 








37 


35 


21 


0 


54005 


1 


35221 


0 


18907 








39 


26 


14 


0 


83555 


0 


50640 


0 


09662 










29 


16 


0 


68308 


2 


47157 


0 


40168 










33 


25 


0 


.98551 


2 


.01246 


0 


29597 










37 


27 


0 


.96775 


1 


.59789 


0 


.08675 












30 


0 


.68921 


2 


.77731 


0 


.22115 












31 


1 


.01358 


1 


.82906 


0 


. 14133 












33 


1 


.59430 


2 


.11449 


0 


.12061 












34 


1 


.31935 


2 


.29660 


0 


.14979 












35 


1 


.07980 


3 


.20302 


0 


.11385 












36 


0 


.89043 


2 


.91767 


0 


.12718 












38 


1 


.29152 


2 


.56220 


0 


.05966 












39 


1 


.49669 


2 


.66925 


0 


. 11299 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for Education Statistics. 



a 

ERIC 



U7 



Psychometric Report for the NELS:88 
Base Year Through Second Follow-Up 



Appendix E-3 
Test Item Map 
Science 



1 
2 
3 
4 
5 
<S 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 



Answer 
Key 
3(C) 
5(E) 
1(A) 
3(C) 
5(E) 
5(E) 
1(A) 
1(A) 
2(B) 
3(C) 
3(C) 
3(C) 
4(D) 
3(C) 
1(A) 
3(C) 
2(B) 
2(B) 
3(C) 
2(B) 
3(C) 
4(D) 
3(C) 
1(A) 
4(D) 
3(C) 
4(D) 
1(A) 
1(A) 
2(B) 
4(D) 
1(A) 
4(D) 
KA) 
1(A) 
liB} 
KA) 
4(D) 



# Valid 


Item Number 


in Booklet 




IRT Parameters 




Choices 


88 


90 


92 




A 




B 




C 


4 


1 






1. 


16608 


-0. 


67228 


0. 


37787 


5 


2 






0. 


59777 


-1. 


93399 


0. 


13876 


4 


3 


2 




0. 


69979 


-0. 


57676 


0. 


33921 


4 


4 


3 


5 


0. 


66591 


-0. 


62182 


0. 


36695 


5 


5 


4 


2 


1. 


09400 


-1. 


36000 


0 . 


00000 


5 


6 


5 


1 


1. 


04363 


-1 . 


55512 


0. 


00002 


4 


7 






0. 


52146 


-1 . 


29720 


0 . 


00000 


4 


8 






0. 


62419 


-0. 


25581 


0. 


25386 


5 


9 






0. 


53319 


-1 . 


36224 


0. 


00001 


4 


10 


1 


8 


1. 


10474 


0. 


00281 


0. 


30008 


■ 4 


11 






0. 


43784 


0. 


20647 


0. 


19275 


5 


12 


6 


6 


0. 


85169 


-0. 


65205 


0. 


27561 


4 


13 






0. 


60663 


-1 . 


75538 


0. 


00001 


5 


14 


7 


3 


1. 


23878 


-0. 


41510 


0. 


19739 


4 


15 


8 


15 


0. 


40637 


-0. 


28296 


0. 


00001 


4 


16 


9 


18 


0 


95246 


0 


47833 


0. 


33145 


4 


17 


10 


7 


1 


28611 


0 


12036 


0. 


25544 


4 


18 


11 


9 


0 


97920 


0 


00387 


0 


22460 


4 


19 


12 


14 


1 


01363 


0 


24806 


0 


24407 


4 


20 


13 




1 


15653 


0 


74217 


0 


33252 


4 


21 


14 




0 


96782 


0 


61829 


0 


31361 


4 


22 


15 


16 


0 


67782 


0 


90750 


0 


25591 


4 


23 


16 




1 


43791 


1 


05388 


0 


38865 


5 


24 


17 


20 


0 


.62227 


0 


.20736 


0 


00001 


5 




18 




0 


. 64546 


1 


.18072 


o 


. 09492 


A 
*k 




20 




u 


!88578 


0 


.01877 


0 


.16607 


4 




19 


21 


1 


.46803 


0 


.99365 


0 


!l3903 


4 






4 


0 


.70864 


-0 


.36201 


0 


.34331 


4 




21 


12 


1 


.09783 


0 


.18743 


0 


.17761 


5 




22 


13 


0 


. 80216 


0 


.27046 


0 


.21798 


4 






10 


0 


.37842 


-0 


.57463 


0 


.00001 


4 




23 


22 


1 


.43394 


0 


.96323 


0 


.12356 


4 




24 


11 


0 


.80165 


-0 


.32345 


0 


.10520 


4 




25 




0 


.32691 


0 


.10811 


0 


.00000 


4 






17 


1 


.04588 


0 


.81089 


0 


.21361 


4 






23 


0 


.71678 


1 


.76348 


A 

KJ 


.32502 


4 






24 


0 


.81268 


2 


.18077 


0 


.23181 


4 






25 


1 


.54588 


2 


.40482 


0 


.10371 



Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for Education Statistics. 
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Appendix E-4 
Test Item Map 
History/Citizenship/Geography 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 



Answer 
Key 

3(C) 
3(C) 
2(B) 
KA) 
1(A) 
2(B) 
4(D) 
4(D) 
3(C) 
5(B) 
2(B) 
2(B) 
3(C) 
2(B) 
4(D) 
3(C) 
2(B) 
2(B) 
KA) 
3(C) 
KA) 
KA) 
KA) 
2(B) 
KA) 
2(B) 
4(D) 
2(B) 
2(B) 
3(C) 
4(D) 
KA) 
2(B) 
2(B) 
2(B) 
KA) 
KA) 
4(D) 
2(B) 
3(C) 
KA) 
3(C) 
4(D) 
2(B) 
3(C) 
2(B) 
KA) 



# Valid 
Choices 



4 

4 

4 

4 

4 

4 

4 

5 

5 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

2 

2 

2 

2 

2 

5 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 



Item Number in Booklet 



IRT Parameters 



88. 
4 

26 

22 
12 
28 
2 
13 
14 
15 
16 

23 
18 
20 
3 
1 
30 
17 

29 
5 
6 
7 
8 
9 

19 

21 
10 
24 

25 
11 



27 



90 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 

12 
13 
14 
15 
16 
17 
18 

19 



20 

21 
22 
23 
24 
25 
26 
27 

28 
29 
30 



92 
2 
14 



6 
18 

3 
10 
12 
13 
26 
11 



25 
1 
22 
16 



B 



21 



5 
23 
9 



29 
15 



17 

8 
19 
20 
24 
27 
28 
30 



0 
2 
1 



0.98219 
1.12623 
0.29554 
1.45953 
0.57016 
1.52760 
1.10537 
1.36141 
0.75018 
1 .02945 
1.24221 
1.48652 
0.93498 
0 . 87587 
.71144 
.03444 
.07288 
1. 88350 
1.00430 
1.30349 
1.35758 
0.96925 
0.52152 
1.64167 
1.03994 
1.75480 
1 .49480 
0.88606 
1.20516 
1.10922 
0.84672 
0.63192 
0.76584 
1.59962 
0 .44765 
1.25594 
0.90837 
0.93793 
0.68855 
1.15943 
0.41296 
1.32067 
0. 97527 
0.70172 
1.11145 
1.02496 
1.28831 



-1, 
-0. 
-0. 
0. 
-1. 



-1.25256 
0. 00140 
. 37111 
.02180 
.93455 
.44390 
.33515 
-0.26818 
0. 47592 
0 . 02726 
0.56911 
1.48763 
0.28607 
-1.26965 
-1.13364 
-1.52077 
-1 .08690 
0,75941 
-1.84445 
1 .25515 
0.50549 
-1.92663 
-2 .69376 
-2.11534 
-2 .19188 
-2 .12320 
-1 .14670 
0 .99954 
-0 .62570 
-0 .44457 
-0 .60389 
0.82388 
-0 .22218 
-0 .06140 
-1.46990 
2.25819 
-0.30759 
0.77969 
1.62702 
0.48314 
-1.05935 
0.75449 
0.14559 
0.80714 
1.64311 
1.71842 
2.25424 



0.21137 



28845 
00000 
26657 
02822 
27880 
0.26274 
0.32572 
0.25624 
0.18382 
0.29637 
0 .29832 
0.29308 
0.33294 
0.08806 
0.46357 
0.48813 
0.19735 
0.27435 
0.26184 
0.23433 
0.23751 
0.00000 
0 .00000 
0.00000 
0.00000 
0.24233 
0 .29325 
0 .35219 
0 .51625 
0 . 15013 
0 .07269 
0 .21016 
0.30746 
0.00168 
0 .20646 
0.13674 
0.28098 
0 .31263 
0 .32292 
0. 00000 
0.30523 
0.21349 
0.25314 
0 .15251 
0.22389 
0.15843 



Source: National Education Longitudinal Study of 1188: Second Mow-Up, U.S. Department of Educxliun. National Center for liduomon Statistic 
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Across Years 
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Appendix G: Test Information Function-Theta 

(Ability) 
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Appendix G 



Base Year Reading (One Form) 
Test Information Function 
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Base Year Math (One Form]" 
Test Information Function 
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Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for Education 
Statistics. 
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Appendix G (Continued) 

Base Year Science (Une i-orm) 
Test Information Function 
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Base Year Hlstcxy/Cltlzenshlp/Ueography (One l-orm) 
Test Information Function 
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! N.tioral Eduction Longitudinal Study of 1988: Second Follow-Up. U.S. Department of Eduction, National Center for Education 



Source: 
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Appendix G (Con tinued) 

First Followup Reading (Two Forms) 
Test Information Function 
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Rrst Followup Math (Throe Forms) 
Test Information Function 
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Source: National Education Longitudinal Study of 1988: Second Follow Up. U.S. Department of Education, National Center for Education 
Statistics. 



163 



141 



Psychometric Report for the NELS.88 
Base Year Through Second Follow-Up 



Appendix G (Continued) 



First Followup Science (On© KJrm) 
Test Information Function 
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First Foilowup HIstory/OItizenshlp/ueography (One horm) 
Test Information Function 
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Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for Education 
Statistics. 
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Appendix G (Continued) 



Second Followup Reading (Two Forms) 
Tost Information Functions 
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Second Followup Math (Three Forms) 
Test Information Functions 
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Source: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education. National Center for Educati 
Statistics. 



ERJC 1 7 0 143 



Psychometric Report for the NELS:88 
Base Year Through Second Follow-Up 



Appendix G (Continued) 



Second Followup Science (One Form) 
Test Information Function 
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Second Followup History/CWzenshlp/Geography (One Form) 
Test Information Function 
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Sourer: National Education Longitudinal Study of 1988: Second Follow-Up, U.S. Department of Education, National Center for Education 
Statistics. 
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